How Exactly Do Graphics Cards Really Work?

hegrwIn the setup phase, the triangle vertex data streams (x, y, z, color, etc.) are organized for presentation to the rendering engine. Triangles are sorted, culled, and clipped, and edge slopes are calculated for input into the raster engine. Subpixel corrections are needed to avoid anomalies such as poke-throughs and frayed edges. Converting from the floating-point “software” domain to the fixed-point “hardware” domain also is necessary. Doing all this in the host processor burns a lot of CPU cycles.

Finally, in the rasterisation phase, triangles are shaded, sorted, texture mapped, blended, and mapped to the display. Antialiasing and dithering functions are applied to help correct for a number of different artifacts, such as those seen on near-horizontal edges. Quality is not necessarily guaranteed here – some VGCs use simpler scanning techniques which can, for example, result in bleed-through along triangle boundaries.

System Performance

Let’s take a look at how three contemporary systems perform full 3D-geometry lighting-model simulations using fixed 10-pixel triangles. They run 3D software on a 100-MHz Pentium processor with a 2D Windows accelerator, a 3D VGC with 30-Mpixel/s performance, and a hardware-assisted 3D VGC.

For small numbers of triangles, the performance is dominated by system copy calls. They occur most often in the case of the software 3D system (which composes the picture in system memory). The 3D VGC with setup shows lower frame rates due to a double-buffer approach synchronized to the monitor frequency. Above a rate of 500 triangles/s, the rapid falloff in performance of the 3D VGC demonstrates the need for hardware assist. At 10,000 triangles/frame, the 3D hardware is almost irrelevant, and the system is limited by the CPU’s geometry and lighting performance.

What does this all mean in practice? Most 3D games available today use about 1000 to 2000 rendered triangles/frame, from a total of about 5000, corresponding in a depth complexity of about three. Both software 3D and 3D-VGC systems struggle to reach 20 frames/s in this scenario, though having hardware setup helps a lot(again). Even with rich texturing, 1000 to 2000 triangles for the scene or subject doesn’t allow for a very realistic game. The next generation of 1-million-triangles/s games will require scene complexities of 10,000 triangles and beyond. In a typical scene breakdown, the larger triangles represent the greatest area. With rich texture mapping, the larger triangles are essential to the gaming experience. However, the small triangles are essential for realistic details.

Current software and 3D-VGC systems will be completely CPU bound at a few frames per second for this scene complexity, well below the level of 30 frames/s required for interactivity. Using a faster CPU, such as a 200 MHz Pentium II, will raise the performance level by a factor of about three, but the 8 frame/s performance is still too low.

Handling Textures

Sooner or later, all textures have to travel from system memory to the VGC. Today, those bandwidths are typically in the tens of megabytes, and are predicted to double each year. The Advanced Graphics Port (AGP) increases available bandwidths from 266 Mbyte/s for 1X mode (twice the rate of the PCI bus), to 1 Gbyte/s for 4X mode. But increased bus bandwidth is not enough. As required screen resolutions increase above 800 by 600 pixels and realism demands more sophisticated techniques such as trilinear texture filtering and antialiasing, it will be easy to exceed even the 4X AGP bus bandwidth if full texture mip-maps are stored in system memory.

Fortunately, it’s not necessary to store and transmit full-texture mip-maps. Compression factors of 10:1 or better are achievable with minimal texture degradation [ILLUSTRATION FOR FIGURE 2 OMITTED]. Compressed textures are stored in system memory transparent to the application, and are transmitted compressed to the VGC which decompresses them on the fly as needed.

agpScatter-gather PCI/AGP bus mastering is the next essential feature. It has the potential to double system performance. The VGC’s own Memory Management Unit (MMU) can autonomously fetch texture maps from the system memory without interrupting the CPU to provide scattered addresses of data blocks.

Advanced systems achieve a further reduction in band width by building a texture cache into the VGC, and by deferring texturing until clipping and visability are performed. Other key features to look for are perspective-correct texture mapping and the ability to handle generated and video textures.

Although texture compression allows next-generation games to run without saturating PCI-bus bandwidths, there’s still a need for enhancements suchy as geometry- and lighting-acceleration hardware. Microprocessor performance is not increasing fast enough and, until 1999, the best we can expect is a 2X improvement in the CPU’s floating-point performance. It’s reasonable to assume that this improvement will be absorbed by the requirements of 3D applications. Hardware geometry and lighting also helps minimize the bus-bandwidth requirement.

In the meantime, overall system performance can be improved by implementing some hardware-based software techniques.

2D Can’t Cut It

In a sense, the fastest way to render an object is not to have to render it. To squeeze the maximum performance from a limited 2D system, independent software vendors (ISVs) have long exploited techniques like sprites, level-of-detail (LOD), 3D layering, and affine transforms. Sprites allow relatively complex, active objects to be stored as bitmaps and superimposed on a scene. However, as we become accustomed to more sophisticated games, conventional 2D sprites often look disconnected and unrealistic.

With LOD filtering, ISVs don’t bother to render more distant objects. It’s a sensible compromise because the number of objects within the field of view increases rapidly with distance. However, it often leads to the disturbing effect of trees, buildings, and other objects suddenly popping up out of nowhere.

Use of 3D layering allows for a more efficient exploitation of system resources. Objects are grouped according to their distance from the viewer. Background objects like mountains and clouds change very slowly and require infrequent updating. Alternatively, foreground objects need frequent rendering. Combined with an affine warping capability (which can be described as transformations like stretching and skewing), 3D scene updates can be even less frequent. For example, an approaching midground group of buildings, can be slowly zoomed without rendering, and only needs rerendering when the perspective changes by more than a certain amount and/or new surfaces become visible.

Because sprites, 3D layering, and affine warping are all well suited to being handled in hardware, these features will soon appear in the next generation of advanced 3D VGCs. The Talisman initiative from Microsoft represents a multimedia reference platform that incorporates other advanced architectural concepts. These include chunking to avoid frame-buffer memory and minimize bus bandwidth, as well as a range of other multimedia functions such as DVD and video-conferencing.

The other main function of the 3D VGC is the bus-mastered handling of video I/O. During the last few years, video has become an increasingly important feature in the mainstream PC market. Desktop video-editing has been a niche application for years. On business platforms, the long-predicted “killer-application” is video-conferencing. In an intriguing example where games and video merge, a captured video of the game player is inserted into the game in which he or she becomes, quite literally, a leading character in the game.

The underlying problem behind most PC-based video capabilities relates to the differing environments of the PC and the television. For historic reasons, PAL/NTSC TVs use an interlaced scanning system at 50 or 60 half-fields/s with 480 or 512 lines/field. PC monitors typically operate in excess of 75 noninterlaced fields/s with 600 or more lines/field. When analog video is imported into the PC, a number of artifacts must be corrected. Similarly, a PC monitor outputting a PAL/NTSC signal runs into the same problem but in reverse.

Video Deinterlacing

When importing DVD or video data, the simplest method to produce a full picture is to merge or weave two successive odd and even half fields. This technique maintains the picture’s maximum detail provided there’s no motion. But even the slightest object movement causes a disturbing feathering artifact.

The conventional line-doubling solution discards the even fields and repeats each line in the odd field. This way solves the feathering effect, but at the cost of lower vertical resolution. Interpolation helps, but the missing clarity is especially noticeable with (near) static pictures. The best approach, found on some VGCs, is to combine both solutions so that resolution is only lost when needed to correct for motion artifacts.

TV Output

Many users will use a large-screen TV as a PC display for cost reasons, more exciting gaming experiences, or for presentation purposes. Making video information that was specifically generated for the PC look good on a TV presents new challenges to both PC and application designers.

A three-line flicker filter is the VGC’s most essential TV-out feature. To maintain the proper standards, we must correct for the 25 or 30 Hz flickering of (near) horizontal edges. Good horizontal and vertical upscaling (with interpolative filtering) of video data is mandatory. It’s also important to correct for the TV’s over-scanning, which may be suitable for a movie, but disastrous when the lost information is a menu item or scroll bar.

The ability to compose the TV picture independent of the PC monitor can be equally important in making sure the right information is sent to the TV. For example, in a home-movie editing application, the TV output might only be a window on the PC monitor, while on a VCR it will appear as a full-screen image.

Critical performance Issues

The consumer PC must be equipped for the next generation of 3D games, which will soon be rendering scenes at rates of 1 million triangles/s. The CPU and VGC need to offer a much better balanced system for handling the critical performance demands of the 3D pipeline. In the near term, VGCs offering features like scatter-gather bus mastering, texture compression and caching, 3D layering, and off-line warping off-load the CPU and dramatically enhance system performance.

Most designers agree that full-triangle setup belongs in the VGC, although there’s some division of opinion on whether geometry and lighting processing belongs in the CPU or the VGC. Hardware accelerations for geometry and lighting aim to enhance the performance of today’s CPU-bound systems. Designers need to trade off the silicon investment in the CPU against that in the multimedia/VGC subsystem on the basis of the relative importance of the various applications.

Leave a Reply