|
Entheogen posted:I have successfully implemented the 3d slicing technique, however it appears to be rather slow. The way I calculate the inverse matrix also doesn't appear to be much of a factor here, as I only do it once, and then send it to vertex shader which actually multiplies it to tex coordinates. I think the limit here is imposed by my cards texture fill rate and the blending that it has to do between all slices. Also there appear to be some artifacts due to slicing. They appear as these lines that criss cross the volume. That artifact pattern seems weird; I espect that it has something to do with the depth resolution of your texure and the number of slices you do. Still, if you have trilinear filtering working properly, I don't quite see why you should be seeing those artifacts (although I suppose it really depends on what your data looks like). Either way, its definitely a sampling error of some sort. I'd agree that you should "compress" multiple slices into one by sampling your depth texture multiple times per shape. Instead of just sampling "TexCoord[0].stp", grab "TexCoord[0].stp + float3(0, 0, stepSize*n)" as well (for n steps). Obviously, you'd actually want to have that transformed into the proper space as for your 3D texture as well.
|
# ¿ Jul 31, 2008 14:22 |
|
|
# ¿ Apr 20, 2024 04:31 |
|
pianoSpleen posted:As for the threading thing, it's not really that surprising. Just because your program is single-threaded, doesn't mean the display driver it's calling is. If you do something weird like this it's not entirely unexpected that the driver will balance the load between cores to increase performance. in fact, depending on what hardware/driver you are using, I can almost gaurentee that this is the case.
|
# ¿ Sep 6, 2008 04:53 |
|
heeen posted:Why not make it all triangles and save yourself a state change and an additional vertex and index buffer? Yes, you're almost certainly better off with all trianges (preferrably, tri strips)
|
# ¿ Sep 22, 2008 13:05 |
|
shodanjr_gr posted:Can anyone give me some guildlines on geometry shader performance? (a paper/article/tutorial for instance) The problem you are running into is due to the somewhat short-sighted way the Geometry Shader was fitted into the pipeline on a lot of hardware. Essentially, if you're using the geometry shader to do expansion in the data stream, you run into memory contention problems. This is due to the fact that the primatives (triangles, points, etc) need to be rasterized in issue order, which means that the memory buffer coming out of the geometry shader stage has to match the order of primatives coming into the geometry shader (i.e. primative A, or any shapes derived from primative A, must all come before primative B). If you're doing expansion in the shader, this means that processing of Primative B can't start (or at least, the results can't be output) until Primative A is finished. Combine this with the fact that the clipping/rasterizer stage now has to process 4 times faster than the vertex shading stage and that the buffers were optimally sized to a 1:1 ratio, and you have the potential for a lot of bottlenecks. For what it's worth, you *might* want to try doing a two-stage approach: shading your vertices and using the GS to expand, then feeding the output to a Stream-Out buffer, then re-rendering that with a rudimentary vertex shader. Because you're streaming straight to memory, you might not run into the buffer limitation issues. I haven't tested this myself though, and the extra processing + slower memory speed might negate any gains you get. This, incidentally, is why tesselation in the DX10 Geometry Shader is generally discouraged. e: To be taken with a grain of salt. Looking at the post again, 75fps to 4fps seems like a very dramatic slowdown for this sort of thing. It could actually be possible that you're running into software mode, but that seems unlikely based on your description. Hubis fucked around with this message at 18:25 on Sep 26, 2008 |
# ¿ Sep 26, 2008 18:09 |
|
shodanjr_gr posted:Thanks for the input! Unfortunately, no. Adding the ability to disable/relax that requirement has been suggested as a workaround to this, but as is it's an ironclad rule of the API. Remember, you're working with a pipeline, not a series of iterative stages being run on the whole data set at a time. Thus, if data transfer and vertex processing are not the bottlenecks in your pipeline, optimizing them isn't going to give you any performance benefit. Unless you only have one job going through the entire pipeline at a time, your render time is going to be only as fast as the single slowest stage of the pipeline (barring some strange edge cases). If you were using DirectX, I'd say run it through nvPerfHUD and see where your bottlenecks are, then optimize from there. From what you describe, it sounds like Blending/Raster Ops are your bottleneck, not Data Transfer/Input Assembly/Vertex Processing. shodanjr_gr posted:To give you a measure, lets assume that i send over a 200 * 200 grid display list to the card and i generate 1 point in my GS. This runs at < 75 fps and all is well. If i keep the display list constant, and i generate 4 points in my GS, i drop to 25 FPS! If i move up to 9 points, i drop to 12 FPS. 16 points brings me into single digits. At 16 points per grid point, i get a total of 640000 points. Now, if i send an 800 * 800 display list for rendering, and only generate a single point (for the same number of total points), i get at least 15 FPS. So for the same amount of geometry, using the GS to expand gives me a 75% reduction to frame rate compared to a display list... ah ok. Then yes, this sounds like you're running into the issue I described above. Hubis fucked around with this message at 20:41 on Sep 26, 2008 |
# ¿ Sep 26, 2008 20:39 |
|
shodanjr_gr posted:I'll look into your suggestion to see if i can get a buff in FPSes...i really dont want to can this idea, since its been producing very nice results visually. shodanjr_gr posted:edit: I can't speak to AMD internals, but it's my understanding that they run into the same problem. The only difference would be the size of the post-GS buffer, which will determine where you hit those falloffs and how much they'll hurt performance.
|
# ¿ Sep 26, 2008 21:01 |
|
shodanjr_gr posted:Thanks for the replies Hubis. Ill check out the spec you linked. Not that I'm aware of.
|
# ¿ Sep 26, 2008 21:07 |
|
HauntedRobot posted:Kind of a vague question, sorry. You're going to run into signficant z-buffer precision problems leading to Z-fighting for things that are near one another (within one exponential precision value from one another) unless they are drawn in strict back-to-front order, including sub-draw call triangle order.
|
# ¿ Oct 1, 2008 21:30 |
|
Mithaldu posted:Due to finally being able to compile OpenGL 0.57 in CPAN, i have now access to a workable example of vertex arrays in Perl. If i look at this correctly, it means that it can draw a single triangle in exactly one call, as opposed to the 6 needed in immediate mode. This would mean a massive speed-up since Perl has a horrible penalty on sub calls. there are no real concerns with combining them, as vertex arrays are very commonly used and self-contained. Just don't go overwriting buffers the driver is still using and you should be ok. Also, I'm still confused as to why you're issuing only one triangle at a time -- are you really completely changing state every draw?
|
# ¿ Oct 19, 2008 16:07 |
|
Mithaldu posted:Got a question about z-fighting issues. I'm rendering layers of roughly 80*80 objects, with roughly 20-40 layers stacked atop each other at any time. There is no computationally non-expensive way to cut out invisible geometry that i am aware of, so i pretty much have to render everything at all times. This leads to the z-buffer getting filled up pretty fast, resulting in rather ugly occasions of polys clipping through other polys. You could try disabling Z culling altogether, as it seems like it's very easy for you to render from nearest-to-furthest order. To save shading work, you can use the stencil functionality to write the current layer number to the stencil buffer, and to only output a given pixel if the current layer has nothing written to it already (i.e. == 0). Rendering top to bottom in this manner and using the stencil buffer to avoid overdraw should work. That being said, I'm kind of surprised you're having trouble with Z precision like that. Your z-clear trick would probably work, though I'd reccomend rendering from top down and using the stencil technique mentioned above as well if you go that route.
|
# ¿ Oct 25, 2008 19:04 |
|
Mithaldu posted:Disabling the depth buffer is not an option, as the z-culling needs to be kept functional for each z-layer. Sounds like a pretty neat approach though. (Although i'm confused by what you mean with shading work.) It's a dwarf fortress clone, by chance? By shading cost, I meant the cost of doing all the texture lookups, combinations, and copying/blending to the framebuffer. If you're rendering 20 overlapping objects, your GPU is basically doing 20x more work than it has to ("overshading"). Of course, I just remembered that you're doing this in Perl, so it's almost certainly CPU limited OK, so you've got this world-space box enveloping the entire scene, and a camera position (which is either external or possibly internal, if you're zoomed in). You've also got Z-Near and Z-Far, which you say you're clamping to max(1.0f, <nearest point in the scene>) and <furthest point in the scene>, respectively. Do I understand correctly? This seems to be what I'd advise. What I find strange is that the Z-sorting issues get worse the closer you get to the objects. Usually, problems arise when you get further away from the camera, and the distance between overlapping objects becomes smaller than the preceision at that distance (which, in turn, gets better closer to the camera). It almost seems like the scene is doing the opposite of what it should be doing...
|
# ¿ Oct 25, 2008 20:36 |
|
Mithaldu posted:It's not a clone, it's this: http://code.google.com/p/dwarvis/ If you have the game and a decently complex map you can try it out for yourself with the version that's for download there. Very interesting! I'll have to give it a whirl From what I've been seeing, you're taking a very naive (though valid) approach to rendering -- I presume that each tile is a separate draw call, along with each creature, object etc? You're definitely going to introduce a lot of overhead, just in driver command stream alone, not even considering the function call overhead of perl. The way many games address these sorts of issues is by 'batching' draw calls together, wherever possible. Once you get to a place where you've got enough functionality to worry about performance, I can give some more advice if you'd like. quote:Regarding your understanding of what i'm doing: Exactly, thanks for summarizing it so well. Nonsense The Z-near and Z-far coordinates help you place the positions of the near and far planes of the view frustum, respectively, and in turn determine components of your projection transform matrix. This matrix is used to transform your vertices from "view coordinates" (which the Modelview matrix produces) to "clipping coordinates", which are in the range [0, 1] on all axes. The X- and Y-coordinates represent the screen position from (left, bottom) to (right, top), while the Z-coordinate represents the vertex's position between the near and far clip planes as a value between 0 and 1. You can think of the View-to-Clip transform as warping and re-scaling the dimensions of the pyramidal Frustum to a unit-cube. These "clip" coordinates are what the hardware uses to do clipping to the edge of the screen, and what is fed to the Z-test unit as depth coordinates. Thus, [1, 150] and [1001, 1150] should be the same to the hardware, since they both get converted to [0, 1] in the vertex shader. Furthermore, if we're actually talking about being "outside" the scene versus being "inside" the scene, then the numbers would be more like outside=[1000, 1150] and inside=[1, 75] because your camera will by definition be closer to the far plane than the boundery of the scene is. This means that not only will these offending objects (like the blue item in the near field) be moving from the low precision far end of the depth buffer to the higher precision near-end, but that there will be more precision per unit in general! Of course, it's been a while since I worked with OpenGL, so I might be mis-thinking about something. The OpenGL FAQ on Depth has some interesting things to say on the subject. In particular, they suggest doing something along the lines of your "Clear the depth between layers" approach for very large depth distances; however, I still think something else is going awry in your code, because again this seems like the opposite problem you should be having. Can you try just clamping your planes to something like [1, 1000] and moving around the world to see if you're still having issues? Hubis fucked around with this message at 22:36 on Oct 25, 2008 |
# ¿ Oct 25, 2008 22:33 |
|
Jo posted:You can turn off all the lights in a room, glow in the dark paint still lights up. (bad analogy; that's emissive material) Since indirect illumination is really complicated to actually simulate properly (as opposed to direct illumination, which has a simple analytical solution once you determine visibility), the 'ambient' component is used and is basically a hack to approximate secondary and greater light bounces off of other surfaces in the scene (as opposed to coming directly from the light source). In most cases, ambient light should just be treated exactly as direct illumination (using the same materials, etc) except that it is completely diffuse and directionless -- i.e. instead of N.L*cLight*cDiffuse+R.L*cLight*cSpecular it would just be cLight*cAmbient. Glow in the dark paint is emissive, which means light that is emitted directly from the surface; it can be whatever color you want.
|
# ¿ May 7, 2009 15:26 |
|
shodanjr_gr posted:Well, I wont lie to you, I was thinking about it, but I seriously doubt that the iphone has enough omph to do even rudimentary real-time volume rendering. The CPU is clocked at 400mhz and the graphics chip is not programabe (so stuff like ray marching is out of the question). So I'm developing it using ObjC/Cocoa/Glut on my white macbook with a 9400m. I have high confidence you could get this working on an NVIDIA Ion system, which is pretty close to that. What I'd really be interested in is if you could get it working on a Tegra, but I don't know how you'd get a test platform for that.
|
# ¿ May 14, 2009 01:44 |
|
Jo posted:I'm just surprised. I always shrugged GLSL off as nothing more than a way of applying fancy textures. That's what 'shader' meant to me. To see that it does lighting and geometric transforms is a very strange and eye-opening realization. 'vertex program' and 'fragment program' are generally more correct than 'shader', especially nowadays. Even in the 'fixed function' pipeline, very little is actually in fixed function hardware.
|
# ¿ Jun 8, 2009 11:16 |
|
Stanlo posted:Indices reduce the size, both on the file and on the GPU. Space (or more correctly, bandwidth) is factor in that indices allow adjacent triangles using the same vertices to only load the vertex data once (using a pre-transform cache); however a bigger benefit of using indices is that it allows the GPU to cache the results of the transform/vertex shader stage into a post-transform cache as well, thus saving the the cost of the vertex processing. So yeah, doesn't matter if you're not GPU performance bound, but pretty important if you are.
|
# ¿ Nov 13, 2009 02:17 |
|
not a dinosaur posted:Can I seriously not tile textures from an atlas in OpenGL not without doing some pixel shader tricks, no.
|
# ¿ Jan 14, 2010 21:50 |
|
not a dinosaur posted:I've never worked with GLSL before but I spent some time reading about it tonight and it sounds pretty interesting. I'm working on a little 2D game that currently uses a bunch of textured quads and renders them using VBOs. The quads are never rotated and never change size, but the texture coordinates change fairly often for animated sprites and that kind of thing. Yep. In fact, you should have a bunch of TEXCOORDn attributes (for n = [0, 7) I think) which are commonly used to pack those things in. The performance increase you'd see would be almost entirely dependent upon where your current bottlenecks are; however, I'd be willing to say that you're likely CPU/Driver interface bottle necked, in which case doing fewer frequent API calls and offloading more work to the shaders would almost certainly be a win.
|
# ¿ Jan 26, 2010 18:44 |
|
Contero posted:I'm trying to get rid of this annoying diamond artifact: Can you post a wireframe screenshot? I have a suspicion of what's going on. The most common way of converting the height map to geometry is by making squares out of each four adjacent vertices, then bisecting the squares to form two triangles. However, if you do this, sometimes the bisection direction will run counter to the actual contour of the surface. What you really want to do is calculate the normals of the two triangles in each bisection direction (top-left to bottom-right, and bottom-left to top-right) and find the dot product between them. Then, use the bisection where the dot product is the least (i.e. where the normals are most different). This will ensure that your geometry matches the contours of the underlying terrain as closely as possible.
|
# ¿ Mar 4, 2010 05:52 |
|
Contero posted:Unfortunately this is from Nvidia's ocean FFT example, and the map is going to change every frame. Changing up my vertex ordering every frame might be a little complicated / slow. Ohhh... for some reason I thought this was a static heightmap. Hmm. Could you just use normals from the normal map instead of using interpolated triangle normals? That would free your lighting model from the underlying geometry division. OneEightHundred posted:No, order by (X+Y), meaning it'll alternate on both axes. You'll still see the artifacts (the underlying problem of the quads being non-planar will remain) but it will reduce the visibility of it.
|
# ¿ Mar 6, 2010 02:42 |
|
OneEightHundred posted:Generally speaking, shader changes are more expensive but nearly everything else causes a full re-upload of the state on modern hardware. Actually, very few things cause stalls/full state re-uploads nowadays. State management used to be a bigger issue on older hardware, but nowadays there's enough transistors hanging around that the hardware is able to version state changes and "bubble" them through the pipe-line with the rendering work-load. Even changing render targets doesn't matter as much anymore, unless you're trying to re-bind the texture for reading somewhere else in the pipeline (in which case, the hardware will pick up that it's trying to be used in two different ways, and stall until the changes are committed). However, state changes aren't necessarily free. Driver overhead is usually a big culprit -- the DirectX stack is pretty deep, and there are a lot of opportunities for cache misses to occur when you're binding resource views in various parts of memory that need to be made resident, have their ref-counts checked, etc. This turns a semantically simple operation like pushing some textures into the command buffer into hundreds of idle cycles. This is actually the motivation behind the "Bindless Graphics" extensions in OpenGL. Instead of having active state, you just put your texture headers/transform matrices/whatever into a big buffer in memory, give the base pointers to the GPU, and have the shaders fetch what they need from where they need it on the fly. Another thing to concern yourself with is synchronization points. Any time you create resources (and sometimes even when you release them) the CPU and GPU have to sync, idling whichever one was ahead. If you're not careful with how you're mapping your constant buffers (for example) or something like dynamic vertex data, you can end up injecting a lot of unnecessary overhead. Obviously this will vary a little with the under-the-hood hardware and driver implementations of various vendors (ATI seemed to indicate at GDC that changing the Tessellation state in DX11 might be slightly expensive for them, while NVIDIA doesn't seem to have any penalty for example) but this should all be accurate "to first order". Of course, if you're making games that target hardware pre-2008ish, they might not have as robust state versioning functionality, so this will still matter. And if you're using D3D9 (OneEightHundred is definitely right about the improvements in D3D10/D3D11) you definitely still need to be careful. However, in modern APIs/hardware, its becoming less of a concern. Pay attention if you're mixing Compute (DirectCompute/OpenCL/CUDA) and Graphics (Direct3D or OpenGL), though. Hubis fucked around with this message at 03:47 on Apr 29, 2010 |
# ¿ Apr 29, 2010 03:44 |
|
Yeah, I *believe* so. The hardware numbers all get messy around G80. Definitely on Ion2.
|
# ¿ May 2, 2010 17:50 |
|
UraniumAnchor posted:I'm trying to work out how to draw terrain as a crossover pattern of triangles, like so: Is there any reason you have to use that pattern, as opposed to something like code:
|
# ¿ May 4, 2010 23:14 |
|
OneEightHundred posted:Using anything other than crossover gives you "diamond" or "stripe" artifacting caused by diagonals running parallel to their neighbors and verts having mixed 45/90 degree angles with other verts instead of being consistent. Oh! Right, that conversation. Well, anyways, if you're using indexed triangles and issuing them in a fairly local order (such as adjacent triangles across a strip), you're essentially going to be getting no benefit from tri-strips. Why? Because the hardware caches the vertex shader output for a specific vertex index, so that when you request another triangle, you don't have to transform its vertices if their index is already in the cache. The size of the post-transform cache varies with the number of attributes you're passing down, but in general it's big enough that it saves a lot of work. Thus, a tri-strip/fan will only benefit you if your index buffers are eating up a lot of memory/bandwidth. If you're really rendering hundreds of thousands of triangles per frame, you'd get better results dicing up your scene and frustum culling. So TL:DR -- don't worry about strips/fans as a first-step optimization, use indexed primitives, and issue your triangles in a local pattern.
|
# ¿ May 5, 2010 20:07 |
|
Bonus posted:I'm just using glBegin/glVertex/glEnd for each triangle. Does using a vertex array or VBO really help that much? I have an 8800 GT so the card itself shouldn't be a problem. I'll try using vertex arrays and see how it goes. yeah. It helps a lot, if you're at all CPU-bound (which you will be if you're rendering 2 million triangles that way). What you'll want to do is dice it up into 256x256 tiles when you load it (so each tile has indices that will fit in a 16-bit int) and generate an indexed vertex array for each tile. The indexing will let you take advantage of the aforementioned caching, the vertex arrays will let you avoid the driver/CPU overhead of a ton of gl calls, and the VBOs will save you the PCI-E bandwidth of streaming your entire heightmap over the bus every frame. If you want, you can even chunk the tiles down smaller to 64x64, and do frustum culling on them as well. Hubis fucked around with this message at 04:43 on May 6, 2010 |
# ¿ May 6, 2010 04:37 |
|
heeen posted:Is there a better technique for shadowmaps than 6 passes for the six sides for a depth cubemap? I've heard the term "unrolled cubemap" somewhere, what's that about? If you have access to it, one way to avoid the multiple passes is to use multiple render targets and the geometry shader to duplicate input geometry and assign it to the appropriate render target
|
# ¿ May 6, 2010 21:17 |
|
UraniumAnchor posted:Is there a way to write shader attribs straight from CUDA without having to pass through main memory? You can write to any sort of buffer (Vertex/Index/Constant, Texture, DrawIndirect) with CUDA using the D3D/OpenGL interop API. I'm not sure precisely what you are saying you want to do, but you should be able to : 1) Create a texture/constant buffer in your graphics API 2) register it with CUDA 3) map the resource(s), getting back a void pointer corresponding to the device memory address of the buffer 4) run the CUDA kernel using that pointer, 5) un-map the resource (releasing it to the graphics API), and 6) Bind the resource and use it in a shader
|
# ¿ May 8, 2010 04:11 |
|
UraniumAnchor posted:I mostly want to avoid having to pass the heightfield over the bus since the only thing that even cares about it is the GPU. I assume this approach is smart enough to realize that the memory address lives on the card and doesn't need to transfer it around? And it can handle a somewhat large (2000 on a side square) dataset? Yeah, this is exactly what you'd want -- this lets you generate the data on the GPU via CUDA directly into device memory, and then simply re-map the device buffer as a GL texture without transferring from device to host. There's some driver overhead in this mapping so you want to make the map call as rarely as possible (interestingly, the overhead is per-call, not per-resource, so map as many resources as you can in each call); however, it's going to be a whole lot better than the Device->Host->Device memcpy latency you'd otherwise have to deal with. The only limit on size is (a) your available texture memory, and (b) the texture/buffer size limitations of your graphics API. Another thing to consider is that you should make use of both the Base and Pitch properties when writing to the memory in CUDA, as the graphics API may have specific requirements for how the elements are laid out based on the element format (RGBA8 vs RGB8 vs R32f, etc)
|
# ¿ May 10, 2010 17:16 |
|
OneEightHundred posted:This isn't really true. While min/max are usually built-in, what you just typed out is exactly the kind of instruction emitted in shader models that don't support branching. This is basically it. Branching comes in two forms: Conditional instructions (such as "if (a>b) {a=b}") which cost nothing extra, and divergent code paths (such as "if (a>b) { a = tex2D(texFoo); } else {b = tex2D(texFoo);}") which may potentially have the cost of executing both code paths. In the first case, you are just running the same instructions, and each thread may get different results based on the values in their registers; in that sense, "a = min(a, b)" is no different than "a = a+b". In the second case, you can think of the GPU as processing fragments/threads in "clusters" which all execute together, with a mask for each thread in the cluster saying whether to use or discard the results for a given instruction. When all the threads in a cluster go down the same path (such as all the fragments generated by a small triangle) the GPU is smart enough to detect this, and only execute instructions in that path. If the cluster is split (i.e. it is "divergent") then you have to issue instructions for both paths, even though only a subset of the threads will actually use the results from each. So, if you've got high-level branching, such as changing material type based on constant buffer values or low-frequency changes, you won't really see any penalty; if you've got very locally divergent execution patterns, then you'll see worst-case performance.
|
# ¿ May 11, 2010 12:50 |
|
YeOldeButchere posted:A little while ago I asked about the performance of branching in shaders, thanks for the answers by the way, but I'd like to know where to find that sort of info myself. I'm guessing this is GPU dependent and outside the scope of specific APIs which would explain why I don't recall anything about this mentioned in the D3D10 documentation. A quick google search doesn't seem to return anything useful either. I've found some info on NVIDIA's website but it's mostly stuff about how some GPU support dynamic branching and very little about performance. Ideally I'd like something that goes in some amount of detail about modern GPU architectures so I can really know why and when it's not a good idea to use branching in shaders, preferably with actual performance data shown. It's going to really vary, not just from vender to vendor but from chip to chip. If you want to test it yourself, you probably want to test (a) static branches, (b) branches off of constant buffer values, and (c) dynamic branches (either on interpolants or on texture lookups) at varying frequencies. It's really hard for you to test, though, because it's not obvious how the compilers will interpret the code branch into assembly (as a conditional move versus an actual branch).
|
# ¿ May 17, 2010 22:52 |
|
passionate dongs posted:Does anyone know what is going on here? GL_FLAT uses only the first vertex for the primitive, so is it possible that you've got bad normals in one of the other two? The errors look pretty systemic...
|
# ¿ Oct 27, 2010 23:44 |
|
UraniumAnchor posted:How well does OpenCL handle loops that don't necessarily run for the same number of repetitions? Would the other threads in the warp just stall while waiting for the longer one to finish? yes, that's exactly what would happen (you'd suffer a hardware utilization penalty anywhere you have divergence within a warp). However, in the case of a ray-caster, rays within the same warp should ideally be fairly spatially coherent, so it's likely that adjacent elements will be following the same code path, and thus not be too divergent. You will still see this problem around shape edges, for example, but it will depend somewhat on your scene complexity as to how much that actually matters. You could try assigning your threads to pixels using a Hilbert Curve instead of a naive 2D mapping, which should improve your spatial coherence somewhat.
|
# ¿ Dec 6, 2010 15:16 |
|
Optimus Prime Ribs posted:Is there ever a justifiable reason to use display lists in OpenGL? ARB is trying their darndest to make them disappear, despite their potential convenience in reducing API overhead (mostly because as they currently exist, they put a lot of constraints on future extensions).
|
# ¿ Mar 22, 2011 04:56 |
|
OneEightHundred posted:So, one thing I haven't been keeping up on: How important is throughput now, if at all? That is, 2003-ish, static buffers were becoming hot poo poo to reduce data throughput to the card compared to the previously-favored approach of essentially bulk-copying data from system memory to the GPU every frame. Bandwidth matters a little, but usually with constantly updating dynamic buffers API overhead (in DirectX) can matter more if you're not strongly GPU-bound. One good trick is to try and issue all your buffer updates (and other state for that matter) in a single large block -- this lets the drivers internal tracking/ref-count data stay in caches, and you'd be surprised how much it can improve performance if you are API-bound. Also, make sure you are mapping your buffers as "DISCARD/Write-only" so the API knows it can stage the write and doesn't have to wait for any calls in the pipe accessing that buffer resource. In general though your instinct is correct. Static buffers are definitely still preferred, but you're not going to notice the downstream PCI-E bandwidth as much.
|
# ¿ Mar 31, 2011 11:13 |
|
OneEightHundred posted:Making card-specific behavior works great when your audience uses one card. That isn't the PC market. Making card-specific behavior there means you're sinking development resources into a fraction of your audience. Or, if you're Epic and are willing to sink a lot of effort into writing low-level engine code so that you can sell the engine at a markup to a gaggle of independent developers. But no, in general it's not a good idea because GPUs change hardware way faster than CPUs do right now, and the major vendors are very different under the hood in a lot of important ways, so it would be hard to come together with a fully shared ISA.
|
# ¿ Oct 27, 2011 02:31 |
|
OneEightHundred posted:Epic would rather spend their time writing code with tangible benefits to as many users as possible too. The fact that it affects more users does not change the fact that, relative to other features they could be implementing, it isn't a very productive use of time. My point was that the people you hear advocating it are all developers who have middleware to sell -- it would be a HUGE investment of time and effort to implement a flexible, multi-platform low-level engine, but could also provide them with performance that would give them a competitive advantage over developers with less resources to implement and maintain such a codebase. Remember back when Mark Rein was out stomping for Larrabee? But that's neither here nor there, because this will never happen. CUDA/OpenCL with some limited access to fixed-function units is probably as close as you'll ever get.
|
# ¿ Oct 27, 2011 15:25 |
|
HolaMundo posted:How should I handle having multiple shaders? The ideal solution is to use the pre-processor to #ifdef out sections of the code based corresponding to different features, then pass defines to the compiler as macros and generate all the permutations you might need. However, it's a lot simpler (and practically as good) to just place the code in branches, and branch based on bool parameters from constant buffers. So long as the branches are based completely on constant buffer values you shouldn't see any problem. This solution is almost as good as using defines on newer hardware; on older hardware (GeForce 7000-era) you might see some slightly slower shader loading/compilation time, but almost certainly not noticeable unless you're doing lots of streaming content. zzz posted:I haven't touched GPU stuff in a while, but I was under the impression that static branches based on global uniform variables will be optimized away by all modern compilers/drivers and never get executed on the GPU, so it wouldn't make a significant difference either way...? Spite posted:You'd hope so, but I wouldn't assume that! The vendors do perform various crazy optimizations based on the data. I've seen a certain vendor attempt to optimize 0.0 passed in as a uniform by recompiling the shader and making that a constant. Doesn't always work so well when those values are part of an array of bone transforms, heh. this should work, so long as the static branch is a bool. Woz My Neg rear end posted:It's almost always preferable to a true conditional to run the extra calculations in all cases and multiply the result by 0 if you don't want it to contribute to the fragment. This is the opposite of true; do not do this. Hubis fucked around with this message at 02:31 on Dec 6, 2011 |
# ¿ Dec 6, 2011 02:18 |
|
OneEightHundred posted:It honestly isn't that hard to create a VBO, map it, and write a function that just copies the parameters into it and increments a pointer, and a flush function to call manually to terminate the primitive, or if you try to push too much in to the buffer. Well, the other problem with removing glBegin/glEnd (which I am strongly in favor of) is that you run the risk of having the same problem as DirectX does, where even for simple examples you need like 500+ lines of cruft to handle all the resource creation, etc. What would be nice is an updated GLUT that creates a VBO in the background and provides an interface to the programmer so they can use something like "glutBegin/glutEnd" for examples, with the clear determination that this is just for illustrative purposes.
|
# ¿ Jan 22, 2012 15:42 |
|
shodanjr_gr posted:Are const arrays inside shader code supposed to be blatantly slow? What graphics card are you seeing this with? Depending on the hardware, the problem probably isn't the const array, but the fact that you are using dynamic indexing into the array. Some GPUs don't really support base+offset indexing, instead mimicing it using registers. Unfortunately, if you index the arrays dynamically, this requires all the accesses to be expanded (either by unrolling loops, or expanding into a giant nasty set of branches). So you could actually be gaining MORE branches, instead of eliminating them. Why do you need to index the edges like you are doing? Your best bet would be to structure the input to your shader so that it doesn't need to branch at all, even if that means just adding extra interpolants. I'm not sure if that would work for you here, though. e: There's no way to see intermediate assembly with GLSL, right? For DirectX, you could use FXC to dump the shader which might show you if that were happening at the high-level level (though not if it's being introduced at the machine-code translation stage). Hubis fucked around with this message at 14:51 on Mar 20, 2012 |
# ¿ Mar 20, 2012 11:54 |
|
|
# ¿ Apr 20, 2024 04:31 |
|
shodanjr_gr posted:This is on a Quadro 5000. E: Nevermind, thanks! Hubis fucked around with this message at 21:50 on Mar 20, 2012 |
# ¿ Mar 20, 2012 20:44 |