Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Entheogen posted:

I have successfully implemented the 3d slicing technique, however it appears to be rather slow. The way I calculate the inverse matrix also doesn't appear to be much of a factor here, as I only do it once, and then send it to vertex shader which actually multiplies it to tex coordinates. I think the limit here is imposed by my cards texture fill rate and the blending that it has to do between all slices. Also there appear to be some artifacts due to slicing. They appear as these lines that criss cross the volume.

I was wondering, however, do you think if I combine this with a technique I was doing earlier could help both the quality and the speed? What I am thinking about, is to generate space filling cubes again, but this time instead of giving them colors I could give them 3d texture coordinates. I am not sure how much faster it could be than what I am doing now, but it could possibly increase the visual quality. I will try this and report back.

Here is the screen shot:


There are 1000 slices here, and I am using my gaussian filter do isolate a certain data range as well as provide false coloring. You can see the artifact lines criss crossing the volume thou.

ok. it seems to be that these artifact lines only appear in gaussian shader, and not linear one. Here is source code for my linear fragment shader:

code:
uniform float scale_factor;
uniform sampler3D tex3D;

void main(void)
{
   float v = texture3D(tex3D, gl_TexCoord[0].stp).r;
   gl_FragColor = vec4( scale_factor,1,1,v*scale_factor);
}
and here is gaussian one:

code:

uniform sampler3D tex3D;

uniform float scale_factor; 
uniform float gauss_a; 
uniform float gauss_b; 
uniform float gauss_c; 

void main(void)
{
   float v = texture3D(tex3D, gl_TexCoord[0].stp).r;
   float exponent = v - gauss_b;
   exponent *= exponent;
   exponent /= -2 *( gauss_c * gauss_c );
   float nv = scale_factor * gauss_a * exp( exponent );
   gl_FragColor = vec4( 1-v, v,1, nv);
}
How could I still use normal distribution function but avoid these artifact lines?

That artifact pattern seems weird; I espect that it has something to do with the depth resolution of your texure and the number of slices you do. Still, if you have trilinear filtering working properly, I don't quite see why you should be seeing those artifacts (although I suppose it really depends on what your data looks like). Either way, its definitely a sampling error of some sort.

I'd agree that you should "compress" multiple slices into one by sampling your depth texture multiple times per shape. Instead of just sampling "TexCoord[0].stp", grab "TexCoord[0].stp + float3(0, 0, stepSize*n)" as well (for n steps). Obviously, you'd actually want to have that transformed into the proper space as for your 3D texture as well.

Adbot
ADBOT LOVES YOU

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

pianoSpleen posted:

As for the threading thing, it's not really that surprising. Just because your program is single-threaded, doesn't mean the display driver it's calling is. If you do something weird like this it's not entirely unexpected that the driver will balance the load between cores to increase performance.

in fact, depending on what hardware/driver you are using, I can almost gaurentee that this is the case.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

heeen posted:

Why not make it all triangles and save yourself a state change and an additional vertex and index buffer?

Yes, you're almost certainly better off with all trianges (preferrably, tri strips)

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

shodanjr_gr posted:

Can anyone give me some guildlines on geometry shader performance? (a paper/article/tutorial for instance)

I am using GSs to output GL_POINTS. If i output 4 or less points per shader call, performance is fine. Anything more than that, and i get 1-4 fps (down from > 75). I have a feeling that the shaders MIGHT be running in software for some reason, but i cant figure out what i am doing wrong...Any ideas?

The problem you are running into is due to the somewhat short-sighted way the Geometry Shader was fitted into the pipeline on a lot of hardware. Essentially, if you're using the geometry shader to do expansion in the data stream, you run into memory contention problems. This is due to the fact that the primatives (triangles, points, etc) need to be rasterized in issue order, which means that the memory buffer coming out of the geometry shader stage has to match the order of primatives coming into the geometry shader (i.e. primative A, or any shapes derived from primative A, must all come before primative B). If you're doing expansion in the shader, this means that processing of Primative B can't start (or at least, the results can't be output) until Primative A is finished. Combine this with the fact that the clipping/rasterizer stage now has to process 4 times faster than the vertex shading stage and that the buffers were optimally sized to a 1:1 ratio, and you have the potential for a lot of bottlenecks.

For what it's worth, you *might* want to try doing a two-stage approach: shading your vertices and using the GS to expand, then feeding the output to a Stream-Out buffer, then re-rendering that with a rudimentary vertex shader. Because you're streaming straight to memory, you might not run into the buffer limitation issues. I haven't tested this myself though, and the extra processing + slower memory speed might negate any gains you get.

This, incidentally, is why tesselation in the DX10 Geometry Shader is generally discouraged.

e: To be taken with a grain of salt. Looking at the post again, 75fps to 4fps seems like a very dramatic slowdown for this sort of thing. It could actually be possible that you're running into software mode, but that seems unlikely based on your description.

Hubis fucked around with this message at 18:25 on Sep 26, 2008

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

shodanjr_gr posted:

Thanks for the input!

I dont suppose there is a way to disable that? What i am doing does not actually require in-order rasterization, i use additive blend to accumulate the fragments from all the generated geometry, so order does not matter.

I am just weirded out that generating a display list with my high-quality point mesh and sending it over the card is a lot faster than generating a low quality mesh display list and expanding it in the shader... (for instance, i want to end up with an 800 * 800 point mesh, so i figured id send an 80*80 mesh, then generate further points inside the GS).

Unfortunately, no. Adding the ability to disable/relax that requirement has been suggested as a workaround to this, but as is it's an ironclad rule of the API.

Remember, you're working with a pipeline, not a series of iterative stages being run on the whole data set at a time. Thus, if data transfer and vertex processing are not the bottlenecks in your pipeline, optimizing them isn't going to give you any performance benefit. Unless you only have one job going through the entire pipeline at a time, your render time is going to be only as fast as the single slowest stage of the pipeline (barring some strange edge cases).

If you were using DirectX, I'd say run it through nvPerfHUD and see where your bottlenecks are, then optimize from there. From what you describe, it sounds like Blending/Raster Ops are your bottleneck, not Data Transfer/Input Assembly/Vertex Processing.

shodanjr_gr posted:

To give you a measure, lets assume that i send over a 200 * 200 grid display list to the card and i generate 1 point in my GS. This runs at < 75 fps and all is well. If i keep the display list constant, and i generate 4 points in my GS, i drop to 25 FPS! If i move up to 9 points, i drop to 12 FPS. 16 points brings me into single digits. At 16 points per grid point, i get a total of 640000 points. Now, if i send an 800 * 800 display list for rendering, and only generate a single point (for the same number of total points), i get at least 15 FPS. So for the same amount of geometry, using the GS to expand gives me a 75% reduction to frame rate compared to a display list...

ah ok. Then yes, this sounds like you're running into the issue I described above.

Hubis fucked around with this message at 20:41 on Sep 26, 2008

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

shodanjr_gr posted:

I'll look into your suggestion to see if i can get a buff in FPSes...i really dont want to can this idea, since its been producing very nice results visually.

Can you point me to the right direction for Stream-out buffers?
Now that I think about it, it's a DX10-class feature, so it probably won't be supported outside of DX10 or via extension in OpenGL DX10-class (GeForce 8xxx) or better hardware. However, if you want to give it a try, you want to look at the NV_transform_feedback extension.



shodanjr_gr posted:

edit:

Is there a chance this is an nvidia only problem? (i assume there are pretty large architectural differences between nVidia and ATi GPUs).

I can't speak to AMD internals, but it's my understanding that they run into the same problem. The only difference would be the size of the post-GS buffer, which will determine where you hit those falloffs and how much they'll hurt performance.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

shodanjr_gr posted:

Thanks for the replies Hubis. Ill check out the spec you linked.

Final question, is there a way to query the driver to see if it's running in software mode?

Not that I'm aware of.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

HauntedRobot posted:

Kind of a vague question, sorry.

Lets say I'm working on an outdoor first person game in OpenGL. I want a large view distance, so I'm assuming I'm going to be drawing very large, very simple geometry on the horizon, and small complicated geometry close up. For whatever reason, some way of cheating with a 2 pass skybox rendering method isn't going to work.

What are the performance considerations of doing this? For instance, does it poo poo all over my z buffer precision if in the same scene I'm drawing a 4 poly pyramid on the horizon that's hundreds of units tall and doing detail work close up? Also in that kind of situation, what's an acceptable scale for our coordinates? Is it worth spending time worrying about whether making my map extents minfloat to maxfloat rather than something arbitrary like +- 1024.

You're going to run into signficant z-buffer precision problems leading to Z-fighting for things that are near one another (within one exponential precision value from one another) unless they are drawn in strict back-to-front order, including sub-draw call triangle order.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Mithaldu posted:

Due to finally being able to compile OpenGL 0.57 in CPAN, i have now access to a workable example of vertex arrays in Perl. If i look at this correctly, it means that it can draw a single triangle in exactly one call, as opposed to the 6 needed in immediate mode. This would mean a massive speed-up since Perl has a horrible penalty on sub calls.

However i'll still need to use display lists, since 180 calls per frame is a lot nicer than 46000.

What are good practices when combining display lists and vertex arrays and what should i pay attention to?

there are no real concerns with combining them, as vertex arrays are very commonly used and self-contained. Just don't go overwriting buffers the driver is still using and you should be ok. Also, I'm still confused as to why you're issuing only one triangle at a time -- are you really completely changing state every draw?

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Mithaldu posted:

Got a question about z-fighting issues. I'm rendering layers of roughly 80*80 objects, with roughly 20-40 layers stacked atop each other at any time. There is no computationally non-expensive way to cut out invisible geometry that i am aware of, so i pretty much have to render everything at all times. This leads to the z-buffer getting filled up pretty fast, resulting in rather ugly occasions of polys clipping through other polys.

I've already restricted the near and far clipping planes as much as possible so the geometry has the full z-buffer to play with.

Would it help the clipping issues to render the geometry (when looking at it from the top) by cycling through the z-layers bottom-up, drawing everything in them and calling glClear( GL_DEPTH_BUFFER_BIT ); inbetween every z-layer? If so, would it also help to do a full reset of the projection like this, or would that be unnecessary extra work?
code:
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
    
//snip calculation of clipping planes
    
gluPerspective( $cam_angle, $width / $height, $dist_min, $dist_max );

glMatrixMode(GL_MODELVIEW);

You could try disabling Z culling altogether, as it seems like it's very easy for you to render from nearest-to-furthest order. To save shading work, you can use the stencil functionality to write the current layer number to the stencil buffer, and to only output a given pixel if the current layer has nothing written to it already (i.e. == 0). Rendering top to bottom in this manner and using the stencil buffer to avoid overdraw should work.

That being said, I'm kind of surprised you're having trouble with Z precision like that. Your z-clear trick would probably work, though I'd reccomend rendering from top down and using the stencil technique mentioned above as well if you go that route.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Mithaldu posted:

Disabling the depth buffer is not an option, as the z-culling needs to be kept functional for each z-layer. Sounds like a pretty neat approach though. (Although i'm confused by what you mean with shading work.)

To make things a bit more visual, some screenshots. The first one shows a normal view. You are looking at 5x5 blocks of 16x16 units each. Each unit can be a number of terrain features, as well as a building (red) or a number of items (blue). Currently i'm rendering all items, even if that means they overlap each other, as sorting out already rendered ones causes a performance impact on the cpu side that slows poo poo down. As such, there's probably 3000-6000 items rendered there (visible and underground), all in all coming together to a fuckton of polys.



The landscape blocks are rendered as one display-list each, with the coordinates for all vertices pre-calculated. The buildings and items are rendered as individual 0,0,0-centered display lists that get moved into place by being wrapped in gltranslate commands.

Additionally, each block has not only the top-faces of each landscape unit rendered, but also the no/so/we/ea faces of the edge units, since leaving them out would leave holes in the geometry. These are the first problem factor that i noticed: When zooming in, the begin to clip through the top faces.

Additionally the red building blocks are rarely problems and manage, again when zoomed in, to clip through the top faces *entirely*. Lastly, when just doing screenshots, i noticed that sometimes the stones on the ground that are part of the landscape display lists clip through the blue items.

The dimensions are such that each landscape unit is exactly 1x1x1 units big, speaking in floating point vertex coordinates. (Would it maybe help if i increased that?) Currently the clipping planes are being done by defining a bounding box around all visible content (since it's a neat rectangle) and selecting the two points that are furthest and closest from the screen-plane, when projected along the camera vector and using their distances as parameters.

I am using gluLookAt, with the target being the turquoise wireframe. If the camera is moved towards the target or away from it the clipping planes get adjusted to match the distances. If the near plane falls under 1.0, i adjust it to stay at 1.0.

And that's where my problem comes in. Looking at it all from afar is good and works, but when i zoom in by moving the camera closer, this happens:



Note how the rocks in the bottom left corner clip in front of the blue things, which should be straigh above them. Also note the side-faces at the block edge clipping through the top faces in the line in the bottom-right quarter as well as in the top middle on the dark ground.

I mitigated it a bit by switching from zooming by moving the camera to zooming by changing the fov, but even that doesn't fix it completely (and things tend to jerk and jump around a lot when moving the camera):



:words:

It's a dwarf fortress clone, by chance? :)

By shading cost, I meant the cost of doing all the texture lookups, combinations, and copying/blending to the framebuffer. If you're rendering 20 overlapping objects, your GPU is basically doing 20x more work than it has to ("overshading"). Of course, I just remembered that you're doing this in Perl, so it's almost certainly CPU limited :v:

OK, so you've got this world-space box enveloping the entire scene, and a camera position (which is either external or possibly internal, if you're zoomed in). You've also got Z-Near and Z-Far, which you say you're clamping to max(1.0f, <nearest point in the scene>) and <furthest point in the scene>, respectively. Do I understand correctly? This seems to be what I'd advise.

What I find strange is that the Z-sorting issues get worse the closer you get to the objects. Usually, problems arise when you get further away from the camera, and the distance between overlapping objects becomes smaller than the preceision at that distance (which, in turn, gets better closer to the camera). It almost seems like the scene is doing the opposite of what it should be doing...

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Mithaldu posted:

It's not a clone, it's this: http://code.google.com/p/dwarvis/ :) If you have the game and a decently complex map you can try it out for yourself with the version that's for download there.

And thanks for explaining the shading thing, yeah, OpenGL-wise i have massive amounts of performance left, even on my rather dinky quadro 140 mobile. CPU is the big bottle-neck, which is why i don't bother to cut out invisible stuff.


Very interesting! I'll have to give it a whirl ;)

From what I've been seeing, you're taking a very naive (though valid) approach to rendering -- I presume that each tile is a separate draw call, along with each creature, object etc? You're definitely going to introduce a lot of overhead, just in driver command stream alone, not even considering the function call overhead of perl. The way many games address these sorts of issues is by 'batching' draw calls together, wherever possible. Once you get to a place where you've got enough functionality to worry about performance, I can give some more advice if you'd like.

quote:

Regarding your understanding of what i'm doing: Exactly, thanks for summarizing it so well. :)

As for your last paragraph: You're suddenly disregarding your previous understanding there. :D The traditional "it get's worse in the distance" is only true if the clip planes are fixed at, for example, 1 and 1000. However in my case the clip planes are mobile and can wander from "1-150" to "1001-1150", which makes the distance thing irrelevant.

My theory on why it happens when zoomed in is that OpenGL has an easier time mapping objects between "1001-1150" than between "1-150", which is why i tried to fix it with zooming by adjusting FoV. On the other hand, it might simply be that the problems are universal, but too small to show up when viewed from a normal distance. :iiam:

Nonsense :) The Z-near and Z-far coordinates help you place the positions of the near and far planes of the view frustum, respectively, and in turn determine components of your projection transform matrix. This matrix is used to transform your vertices from "view coordinates" (which the Modelview matrix produces) to "clipping coordinates", which are in the range [0, 1] on all axes. The X- and Y-coordinates represent the screen position from (left, bottom) to (right, top), while the Z-coordinate represents the vertex's position between the near and far clip planes as a value between 0 and 1. You can think of the View-to-Clip transform as warping and re-scaling the dimensions of the pyramidal Frustum to a unit-cube.

These "clip" coordinates are what the hardware uses to do clipping to the edge of the screen, and what is fed to the Z-test unit as depth coordinates. Thus, [1, 150] and [1001, 1150] should be the same to the hardware, since they both get converted to [0, 1] in the vertex shader. Furthermore, if we're actually talking about being "outside" the scene versus being "inside" the scene, then the numbers would be more like outside=[1000, 1150] and inside=[1, 75] because your camera will by definition be closer to the far plane than the boundery of the scene is. This means that not only will these offending objects (like the blue item in the near field) be moving from the low precision far end of the depth buffer to the higher precision near-end, but that there will be more precision per unit in general! Of course, it's been a while since I worked with OpenGL, so I might be mis-thinking about something.

The OpenGL FAQ on Depth has some interesting things to say on the subject. In particular, they suggest doing something along the lines of your "Clear the depth between layers" approach for very large depth distances; however, I still think something else is going awry in your code, because again this seems like the opposite problem you should be having. Can you try just clamping your planes to something like [1, 1000] and moving around the world to see if you're still having issues?

Hubis fucked around with this message at 22:36 on Oct 25, 2008

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Jo posted:

You can turn off all the lights in a room, glow in the dark paint still lights up.

Sounds like you're adjusting a global ambient light instead of the ambient light for the individual materials.

(bad analogy; that's emissive material)

Since indirect illumination is really complicated to actually simulate properly (as opposed to direct illumination, which has a simple analytical solution once you determine visibility), the 'ambient' component is used and is basically a hack to approximate secondary and greater light bounces off of other surfaces in the scene (as opposed to coming directly from the light source). In most cases, ambient light should just be treated exactly as direct illumination (using the same materials, etc) except that it is completely diffuse and directionless -- i.e. instead of N.L*cLight*cDiffuse+R.L*cLight*cSpecular it would just be cLight*cAmbient. Glow in the dark paint is emissive, which means light that is emitted directly from the surface; it can be whatever color you want.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

shodanjr_gr posted:

Well, I wont lie to you, I was thinking about it, but I seriously doubt that the iphone has enough omph to do even rudimentary real-time volume rendering. The CPU is clocked at 400mhz and the graphics chip is not programabe (so stuff like ray marching is out of the question). So I'm developing it using ObjC/Cocoa/Glut on my white macbook with a 9400m.

A basic raytracer should be be more more pheasible.

I have high confidence you could get this working on an NVIDIA Ion system, which is pretty close to that. What I'd really be interested in is if you could get it working on a Tegra, but I don't know how you'd get a test platform for that.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Jo posted:

I'm just surprised. I always shrugged GLSL off as nothing more than a way of applying fancy textures. That's what 'shader' meant to me. To see that it does lighting and geometric transforms is a very strange and eye-opening realization.

'vertex program' and 'fragment program' are generally more correct than 'shader', especially nowadays. Even in the 'fixed function' pipeline, very little is actually in fixed function hardware.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Stanlo posted:

Indices reduce the size, both on the file and on the GPU.

Most of the time for personal projects when perf isn't an issue you can just cheese your way around and draw everything as unindexed triangle lists.

If you want the perf and space boost, a simple optimization would be:

code:
create a unique list of vertices

create empty list of indices
for each triangle
    for each triangle vertex
        put index that vertex appears in unique list of vertices into list of indices
Now, ideally, each vertex needs to be processed only once.

Space (or more correctly, bandwidth) is factor in that indices allow adjacent triangles using the same vertices to only load the vertex data once (using a pre-transform cache); however a bigger benefit of using indices is that it allows the GPU to cache the results of the transform/vertex shader stage into a post-transform cache as well, thus saving the the cost of the vertex processing.

So yeah, doesn't matter if you're not GPU performance bound, but pretty important if you are.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

not a dinosaur posted:

Can I seriously not tile textures from an atlas in OpenGL

not without doing some pixel shader tricks, no.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

not a dinosaur posted:

I've never worked with GLSL before but I spent some time reading about it tonight and it sounds pretty interesting. I'm working on a little 2D game that currently uses a bunch of textured quads and renders them using VBOs. The quads are never rotated and never change size, but the texture coordinates change fairly often for animated sprites and that kind of thing.

Is it possible to store texture coordinates and the height/width of the quad in the normal or something that I don't use, and instead render everything as a bunch of point sprites? I'm thinking I can set up the texture coordinates and quad size in a vertex/fragment shader.. Am I way off base? Am I likely to see much increase in throughput by doing so?

Yep. In fact, you should have a bunch of TEXCOORDn attributes (for n = [0, 7) I think) which are commonly used to pack those things in. The performance increase you'd see would be almost entirely dependent upon where your current bottlenecks are; however, I'd be willing to say that you're likely CPU/Driver interface bottle necked, in which case doing fewer frequent API calls and offloading more work to the shaders would almost certainly be a win.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Contero posted:

I'm trying to get rid of this annoying diamond artifact:



I believe it's mach banding, but I could be mistaken. It's a heightmap where the worldspace normal is stored in a varying vec3 and then the diffuse color is calculated in the frag shader.

Is there a straightforward way of getting rid of this problem? It seems like it should be fairly common.

Can you post a wireframe screenshot? I have a suspicion of what's going on.

The most common way of converting the height map to geometry is by making squares out of each four adjacent vertices, then bisecting the squares to form two triangles. However, if you do this, sometimes the bisection direction will run counter to the actual contour of the surface. What you really want to do is calculate the normals of the two triangles in each bisection direction (top-left to bottom-right, and bottom-left to top-right) and find the dot product between them. Then, use the bisection where the dot product is the least (i.e. where the normals are most different). This will ensure that your geometry matches the contours of the underlying terrain as closely as possible.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Contero posted:

Unfortunately this is from Nvidia's ocean FFT example, and the map is going to change every frame. Changing up my vertex ordering every frame might be a little complicated / slow.

Ohhh... for some reason I thought this was a static heightmap. Hmm.

Could you just use normals from the normal map instead of using interpolated triangle normals? That would free your lighting model from the underlying geometry division.

OneEightHundred posted:

No, order by (X+Y), meaning it'll alternate on both axes.

Something like this:


You'll still see the artifacts (the underlying problem of the quads being non-planar will remain) but it will reduce the visibility of it.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

Generally speaking, shader changes are more expensive but nearly everything else causes a full re-upload of the state on modern hardware.

The real solution is to do things that let you avoid render state changes completely. Merging more things into single draw calls, and overarching stuff like deferred lighting.

Actually, very few things cause stalls/full state re-uploads nowadays. State management used to be a bigger issue on older hardware, but nowadays there's enough transistors hanging around that the hardware is able to version state changes and "bubble" them through the pipe-line with the rendering work-load. Even changing render targets doesn't matter as much anymore, unless you're trying to re-bind the texture for reading somewhere else in the pipeline (in which case, the hardware will pick up that it's trying to be used in two different ways, and stall until the changes are committed).

However, state changes aren't necessarily free. Driver overhead is usually a big culprit -- the DirectX stack is pretty deep, and there are a lot of opportunities for cache misses to occur when you're binding resource views in various parts of memory that need to be made resident, have their ref-counts checked, etc. This turns a semantically simple operation like pushing some textures into the command buffer into hundreds of idle cycles. This is actually the motivation behind the "Bindless Graphics" extensions in OpenGL. Instead of having active state, you just put your texture headers/transform matrices/whatever into a big buffer in memory, give the base pointers to the GPU, and have the shaders fetch what they need from where they need it on the fly.

Another thing to concern yourself with is synchronization points. Any time you create resources (and sometimes even when you release them) the CPU and GPU have to sync, idling whichever one was ahead. If you're not careful with how you're mapping your constant buffers (for example) or something like dynamic vertex data, you can end up injecting a lot of unnecessary overhead.

Obviously this will vary a little with the under-the-hood hardware and driver implementations of various vendors (ATI seemed to indicate at GDC that changing the Tessellation state in DX11 might be slightly expensive for them, while NVIDIA doesn't seem to have any penalty for example) but this should all be accurate "to first order".

Of course, if you're making games that target hardware pre-2008ish, they might not have as robust state versioning functionality, so this will still matter. And if you're using D3D9 (OneEightHundred is definitely right about the improvements in D3D10/D3D11) you definitely still need to be careful. However, in modern APIs/hardware, its becoming less of a concern. Pay attention if you're mixing Compute (DirectCompute/OpenCL/CUDA) and Graphics (Direct3D or OpenGL), though.

Hubis fucked around with this message at 03:47 on Apr 29, 2010

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...
Yeah, I *believe* so. The hardware numbers all get messy around G80. Definitely on Ion2.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

UraniumAnchor posted:

I'm trying to work out how to draw terrain as a crossover pattern of triangles, like so:

code:

+--+--+--+--+-- etc
| /|\ | /|\ |
|/ | \|/ | \|
+--+--+--+--+
|\ | /|\ | /|
| \|/ | \|/ |
+--+--+--+--+
|
etc

As one long triangle strip. I think I have a basic method using degenerate triangles, but I'm curious if the extra triangles get optimized out by modern GPUs.

I'm also wondering if there's a better method, I worked out that I'd have 6 extra triangles for every four tiles, but my vertices would go from 24 to 16. (Plus a bit more per row, but that's probably set.)

Are the extra degenerate triangles generally going to be less costly then the extra vertices from having the triangles be specified as TRIANGLES rather than TRIANGLE_STRIP? Let's assume the terrain mesh is fairly beefy, somewhere on the order of a couple thousand height points square.

Is there any reason you have to use that pattern, as opposed to something like

code:
+--+--+--+--+-- etc
| /| /| /| /|
|/ |/ |/ |/ |
+--+--+--+--+--
| /| /| /| /|
|/ |/ |/ |/ |
+--+--+--+--+--
|
etc
?

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

Using anything other than crossover gives you "diamond" or "stripe" artifacting caused by diagonals running parallel to their neighbors and verts having mixed 45/90 degree angles with other verts instead of being consistent.

Oh! Right, that conversation.

Well, anyways, if you're using indexed triangles and issuing them in a fairly local order (such as adjacent triangles across a strip), you're essentially going to be getting no benefit from tri-strips.

Why? Because the hardware caches the vertex shader output for a specific vertex index, so that when you request another triangle, you don't have to transform its vertices if their index is already in the cache. The size of the post-transform cache varies with the number of attributes you're passing down, but in general it's big enough that it saves a lot of work. Thus, a tri-strip/fan will only benefit you if your index buffers are eating up a lot of memory/bandwidth. If you're really rendering hundreds of thousands of triangles per frame, you'd get better results dicing up your scene and frustum culling.

So TL:DR -- don't worry about strips/fans as a first-step optimization, use indexed primitives, and issue your triangles in a local pattern.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Bonus posted:

I'm just using glBegin/glVertex/glEnd for each triangle. Does using a vertex array or VBO really help that much? I have an 8800 GT so the card itself shouldn't be a problem. I'll try using vertex arrays and see how it goes.

yeah. It helps a lot, if you're at all CPU-bound (which you will be if you're rendering 2 million triangles that way).

What you'll want to do is dice it up into 256x256 tiles when you load it (so each tile has indices that will fit in a 16-bit int) and generate an indexed vertex array for each tile. The indexing will let you take advantage of the aforementioned caching, the vertex arrays will let you avoid the driver/CPU overhead of a ton of gl calls, and the VBOs will save you the PCI-E bandwidth of streaming your entire heightmap over the bus every frame. If you want, you can even chunk the tiles down smaller to 64x64, and do frustum culling on them as well.

Hubis fucked around with this message at 04:43 on May 6, 2010

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

heeen posted:

Is there a better technique for shadowmaps than 6 passes for the six sides for a depth cubemap? I've heard the term "unrolled cubemap" somewhere, what's that about?


If you have access to it, one way to avoid the multiple passes is to use multiple render targets and the geometry shader to duplicate input geometry and assign it to the appropriate render target

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

UraniumAnchor posted:

Is there a way to write shader attribs straight from CUDA without having to pass through main memory?

Specifically what I'd like to do is simulate some 'terrain' morphing (in this case shifting water levels) where the morphing computation is done in CUDA, and passes the updated height information right into the vertex pipeline without ever leaving the card.

You can write to any sort of buffer (Vertex/Index/Constant, Texture, DrawIndirect) with CUDA using the D3D/OpenGL interop API. I'm not sure precisely what you are saying you want to do, but you should be able to :

1) Create a texture/constant buffer in your graphics API
2) register it with CUDA
3) map the resource(s), getting back a void pointer corresponding to the device memory address of the buffer
4) run the CUDA kernel using that pointer,
5) un-map the resource (releasing it to the graphics API), and
6) Bind the resource and use it in a shader

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

UraniumAnchor posted:

I mostly want to avoid having to pass the heightfield over the bus since the only thing that even cares about it is the GPU. I assume this approach is smart enough to realize that the memory address lives on the card and doesn't need to transfer it around? And it can handle a somewhat large (2000 on a side square) dataset?

Yeah, this is exactly what you'd want -- this lets you generate the data on the GPU via CUDA directly into device memory, and then simply re-map the device buffer as a GL texture without transferring from device to host. There's some driver overhead in this mapping so you want to make the map call as rarely as possible (interestingly, the overhead is per-call, not per-resource, so map as many resources as you can in each call); however, it's going to be a whole lot better than the Device->Host->Device memcpy latency you'd otherwise have to deal with. The only limit on size is (a) your available texture memory, and (b) the texture/buffer size limitations of your graphics API. Another thing to consider is that you should make use of both the Base and Pitch properties when writing to the memory in CUDA, as the graphics API may have specific requirements for how the elements are laid out based on the element format (RGBA8 vs RGB8 vs R32f, etc)

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

This isn't really true. While min/max are usually built-in, what you just typed out is exactly the kind of instruction emitted in shader models that don't support branching.

There are instructions that take two values, compare them, and if the comparison checks out, then they assign a value to a register. These are fairly cheap, what isn't cheap though is being forced to evaluate every execution path and then use a conditional to choose the result, or split an operation into multiple draw calls so you can switch to a different shader permutation.

That's what branching was put in to improve, and it's not TOO expensive provided you follow a simple rule: Only use conditions that are likely to produce the same result on nearby pixels. GPUs are optimized for processing blocks of pixels the same way, which makes branch mispredictions VERY expensive.

This is basically it. Branching comes in two forms: Conditional instructions (such as "if (a>b) {a=b}") which cost nothing extra, and divergent code paths (such as "if (a>b) { a = tex2D(texFoo); } else {b = tex2D(texFoo);}") which may potentially have the cost of executing both code paths. In the first case, you are just running the same instructions, and each thread may get different results based on the values in their registers; in that sense, "a = min(a, b)" is no different than "a = a+b". In the second case, you can think of the GPU as processing fragments/threads in "clusters" which all execute together, with a mask for each thread in the cluster saying whether to use or discard the results for a given instruction. When all the threads in a cluster go down the same path (such as all the fragments generated by a small triangle) the GPU is smart enough to detect this, and only execute instructions in that path. If the cluster is split (i.e. it is "divergent") then you have to issue instructions for both paths, even though only a subset of the threads will actually use the results from each.

So, if you've got high-level branching, such as changing material type based on constant buffer values or low-frequency changes, you won't really see any penalty; if you've got very locally divergent execution patterns, then you'll see worst-case performance.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

YeOldeButchere posted:

A little while ago I asked about the performance of branching in shaders, thanks for the answers by the way, but I'd like to know where to find that sort of info myself. I'm guessing this is GPU dependent and outside the scope of specific APIs which would explain why I don't recall anything about this mentioned in the D3D10 documentation. A quick google search doesn't seem to return anything useful either. I've found some info on NVIDIA's website but it's mostly stuff about how some GPU support dynamic branching and very little about performance. Ideally I'd like something that goes in some amount of detail about modern GPU architectures so I can really know why and when it's not a good idea to use branching in shaders, preferably with actual performance data shown.

I guess the real answer here would be to write some code and test the drat thing myself, but I'm wary of generalizing whatever results I'd get without better understanding of the underlying hardware.

It's going to really vary, not just from vender to vendor but from chip to chip. If you want to test it yourself, you probably want to test (a) static branches, (b) branches off of constant buffer values, and (c) dynamic branches (either on interpolants or on texture lookups) at varying frequencies. It's really hard for you to test, though, because it's not obvious how the compilers will interpret the code branch into assembly (as a conditional move versus an actual branch).

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

passionate dongs posted:

Does anyone know what is going on here?

I'm trying to make some tubes. For some reason when I use shade with GL_FLAT the lighting looks right (although not smooth of course). If I turn on GL_SMOOTH, for some reason the shading isn't continuous. Where should I look to fix this? normals? geometry?

Right now each segment is an individual vertex array that is drawn in sequence.

GL_FLAT uses only the first vertex for the primitive, so is it possible that you've got bad normals in one of the other two? The errors look pretty systemic...

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

UraniumAnchor posted:

How well does OpenCL handle loops that don't necessarily run for the same number of repetitions? Would the other threads in the warp just stall while waiting for the longer one to finish?

Specifically I'm thinking of writing a voxel renderer as an experiment, and I figured I'd use octrees for raycasting, so depending on how many nodes would have to be searched in each pixel, a bunch of adjacent pixels might have wildly different search depths. Just wondering if this is a huge problem for OpenCL or if it would still be better than trying to do it all on the CPU.

yes, that's exactly what would happen (you'd suffer a hardware utilization penalty anywhere you have divergence within a warp). However, in the case of a ray-caster, rays within the same warp should ideally be fairly spatially coherent, so it's likely that adjacent elements will be following the same code path, and thus not be too divergent. You will still see this problem around shape edges, for example, but it will depend somewhat on your scene complexity as to how much that actually matters.

You could try assigning your threads to pixels using a Hilbert Curve instead of a naive 2D mapping, which should improve your spatial coherence somewhat.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Optimus Prime Ribs posted:

Is there ever a justifiable reason to use display lists in OpenGL?
I've read online that they can sometimes be a viable option over VBOs or shaders, but I don't know how accurate or dated that is.

Right now I have VBOs and shaders implemented, and I'm just wondering if I should even bother with display lists.

ARB is trying their darndest to make them disappear, despite their potential convenience in reducing API overhead (mostly because as they currently exist, they put a lot of constraints on future extensions).

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

So, one thing I haven't been keeping up on: How important is throughput now, if at all? That is, 2003-ish, static buffers were becoming hot poo poo to reduce data throughput to the card compared to the previously-favored approach of essentially bulk-copying data from system memory to the GPU every frame.

However, it's always had drawbacks in ease of use and draw call reduction: Instancing can be used to handle duplicated models, but it can't handle models at different LOD. Accessing the data for CPU-side tasks (i.e. decals) requires storing it twice. Some things are practical to render from static data in some scenarios, but not others (i.e. skeletal animation can be easily done with vertex shaders, but large numbers of morph targets run out of attribute streams on older hardware). Some things could in theory be done statically, but would require stupid amounts of storage (i.e. undergrowth).

Is the performance hit of using only dynamic buffers and just bulk-copying data even noticeable any more, or is the bottleneck pretty much all draw calls and shaders now?

Bandwidth matters a little, but usually with constantly updating dynamic buffers API overhead (in DirectX) can matter more if you're not strongly GPU-bound. One good trick is to try and issue all your buffer updates (and other state for that matter) in a single large block -- this lets the drivers internal tracking/ref-count data stay in caches, and you'd be surprised how much it can improve performance if you are API-bound. Also, make sure you are mapping your buffers as "DISCARD/Write-only" so the API knows it can stage the write and doesn't have to wait for any calls in the pipe accessing that buffer resource.

In general though your instinct is correct. Static buffers are definitely still preferred, but you're not going to notice the downstream PCI-E bandwidth as much.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

Making card-specific behavior works great when your audience uses one card. That isn't the PC market. Making card-specific behavior there means you're sinking development resources into a fraction of your audience.

Or, if you're Epic and are willing to sink a lot of effort into writing low-level engine code so that you can sell the engine at a markup to a gaggle of independent developers.

But no, in general it's not a good idea because GPUs change hardware way faster than CPUs do right now, and the major vendors are very different under the hood in a lot of important ways, so it would be hard to come together with a fully shared ISA.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

Epic would rather spend their time writing code with tangible benefits to as many users as possible too. The fact that it affects more users does not change the fact that, relative to other features they could be implementing, it isn't a very productive use of time.

My point was that the people you hear advocating it are all developers who have middleware to sell -- it would be a HUGE investment of time and effort to implement a flexible, multi-platform low-level engine, but could also provide them with performance that would give them a competitive advantage over developers with less resources to implement and maintain such a codebase. Remember back when Mark Rein was out stomping for Larrabee?

But that's neither here nor there, because this will never happen. CUDA/OpenCL with some limited access to fixed-function units is probably as close as you'll ever get.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

HolaMundo posted:

How should I handle having multiple shaders?
For example, I now have a normal mapping shader (which also handles texture mapping), but let's say I want to render something else with a texture but no normal map, how would I do that? Having another shader just for texture mapping seems stupid.
Also thought about having a bool to turn normal mapping on or off but it doesn't seem right either.

The ideal solution is to use the pre-processor to #ifdef out sections of the code based corresponding to different features, then pass defines to the compiler as macros and generate all the permutations you might need.

However, it's a lot simpler (and practically as good) to just place the code in branches, and branch based on bool parameters from constant buffers. So long as the branches are based completely on constant buffer values you shouldn't see any problem. This solution is almost as good as using defines on newer hardware; on older hardware (GeForce 7000-era) you might see some slightly slower shader loading/compilation time, but almost certainly not noticeable unless you're doing lots of streaming content.

zzz posted:

I haven't touched GPU stuff in a while, but I was under the impression that static branches based on global uniform variables will be optimized away by all modern compilers/drivers and never get executed on the GPU, so it wouldn't make a significant difference either way...?

Best way to find out is benchmark both, I guess :)

Spite posted:

You'd hope so, but I wouldn't assume that! The vendors do perform various crazy optimizations based on the data. I've seen a certain vendor attempt to optimize 0.0 passed in as a uniform by recompiling the shader and making that a constant. Doesn't always work so well when those values are part of an array of bone transforms, heh.

Basically, you don't want to branch if you can avoid it. Fragments are executed in groups, so if you have good locality for your branches (ie, all the fragments in a block take the same branch) you won't hit the nasty miss case.

this should work, so long as the static branch is a bool.

Woz My Neg rear end posted:

It's almost always preferable to a true conditional to run the extra calculations in all cases and multiply the result by 0 if you don't want it to contribute to the fragment.

This is the opposite of true; do not do this.

Hubis fucked around with this message at 02:31 on Dec 6, 2011

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

It honestly isn't that hard to create a VBO, map it, and write a function that just copies the parameters into it and increments a pointer, and a flush function to call manually to terminate the primitive, or if you try to push too much in to the buffer.

Congratulations, you just wrote a renderer with Begin/End-like behavior, except now you don't have to rewrite everything when you want to do stuff like batching, which in that case would consist of "if the next set of geometry has the same shading properties, then don't flush," and you can easily upgrade it to do stuff like pump multiple quantities of complex vertex data without massive amounts of function call overhead.

"The slow paths don't tell you they're slow" is part of the problem, but the other part is that you'll hit a brick wall on the limitations. If you want a nice example, circa 2002 it was becoming really obvious that extending fixed-function to the kind of effects people wanted it to do was turning into a mess, and the only way to fix it was going to ultimately be scrapping fixed-function the same way they scrapped it for the vertex pipeline when people started wanting to do skinning on the GPU.

The "added simplicity" of legacy paths is just a trap where you'll get used to doing things the lovely way and then have to relearn everything, better to do it right the first time.

Well, the other problem with removing glBegin/glEnd (which I am strongly in favor of) is that you run the risk of having the same problem as DirectX does, where even for simple examples you need like 500+ lines of cruft to handle all the resource creation, etc.

What would be nice is an updated GLUT that creates a VBO in the background and provides an interface to the programmer so they can use something like "glutBegin/glutEnd" for examples, with the clear determination that this is just for illustrative purposes.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

shodanjr_gr posted:

Are const arrays inside shader code supposed to be blatantly slow?

I have a tessellation control shader that needs to index the vertices of the input primitive based on the invocation ID (so gl_InvocationID == 0 means that the TCS operates on the edge of gl_in[0] and gl_in[1], etc).

Initially, my code had an if-statement (which I would assume GPUs don't like that much when it diverges inside the same execution unit) to make this determination. I figured that I could flatten the vertex indices out into a single const int[8] and index them based on the invocation ID (so I could say indices[glInvocationID * 2] and indices[glInvocationID * 2 + 1] and get the stuff that I need).

However, doing this seems to hit me with a 66% performance drop when compared to using if-statements! Would passing the index array as a uniform yield a performance benefit?

What graphics card are you seeing this with? Depending on the hardware, the problem probably isn't the const array, but the fact that you are using dynamic indexing into the array. Some GPUs don't really support base+offset indexing, instead mimicing it using registers. Unfortunately, if you index the arrays dynamically, this requires all the accesses to be expanded (either by unrolling loops, or expanding into a giant nasty set of branches). So you could actually be gaining MORE branches, instead of eliminating them.

Why do you need to index the edges like you are doing? Your best bet would be to structure the input to your shader so that it doesn't need to branch at all, even if that means just adding extra interpolants. I'm not sure if that would work for you here, though.

e: There's no way to see intermediate assembly with GLSL, right? For DirectX, you could use FXC to dump the shader which might show you if that were happening at the high-level level (though not if it's being introduced at the machine-code translation stage).

Hubis fucked around with this message at 14:51 on Mar 20, 2012

Adbot
ADBOT LOVES YOU

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

shodanjr_gr posted:

This is on a Quadro 5000.

This is happening inside a tessellation control shader. For each vertex of the input primitive, you get one invocation of the TCS. All invocations of the TCS for each primitive have access to a gl_in[numVertices] array of structs that contains the per-vertex attributes of the primitive as they are passed by the vertex shader. I want each invocation of the TCS to do stuff for each EDGE of the input primitive and thusly i need to index the vertex positions that define the edge from gl_in. Since the per-edge operations are kind of expensive, I can not have each invocation of the TCS do the operations for ALL edges unfortunately (I am taking this approach for other, cheaper per-primitive operations).


I believe there are utilities released by AMD/NVIDIA that will compile GLSL down to the machine-level assembly for you....

E: Nevermind, thanks!

Hubis fucked around with this message at 21:50 on Mar 20, 2012

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply