Human Fly 2.1 - DSP implementation schematics ============================================================================ Functional specifics: This implementation is meant mostly for speed, not for flexibility. Hence when the implementation is not upto the task it will return errorcodes as specified in the interface. There are however some pending limitations described in the section further on. The rendering pipeline has the following functions: - X/Y culling: removes all out of the viewcone - Z culling: removes all behind camera - Painters algorithm sorting - Parallel processing stages (see pipeline specs.) - Bitmap cacheing (see limitations) Pipeline specifics: This is built up by two parallel pipelines: the cpu and dsp ones. The main purpose of the cpu line is handling both I/O and housekeeping tasks. The dsp takes care of almost everything related to 3d arithmetic. Both lines operate in parallel most of the time, but ofcourse need a sync each rendering cycle. This is done in two places: 1) when the commands are passed from cpu to dsp 2) when the dsp passes rendered data back to the cpu for it to output Parallel execution is the main reason for the speed of this implementation. CPU | DSP ------------------------------------+--------------------------------------- 0a sending objects, tables (synced)| 0b receiving objects, tables (synced) ....................................|....................................... 1a command sending (synced) | 1b command storage (synced) 2a restoring screenarea | 2b 3d object transformation | 3b painters algorithm sorting 3a output of primitives (synced) | 4b sending primitives to cpu (synced) Pipeline stages explained: 0a: Sends 3d objects and bitmaps to the cpu. Note that the dsp might return an errorcode. This indicates the dsp has not got enough room. The application should send less big objects or textures. Using the rest of the pipeline after an error might result in unpredictable results. 0b: Receives objects and textures from the cpu. There is no parsing of objects. Passing invalid objects might result in unpredicatable behaviour. Take note that the dsp has limited space. 1a: Sending rendering commands to the dsp. This ranges from rotation and translation to depthsorting. 1b: Receives rendering commands and stores them. It is possible to overflow the command-buffer. Beware! 2a: This is ofcourse application-specific. This may use the bounding areas calculated in a previous cycle (see 3a). 2b: Executes stored commands in consecutive order. All vertex/vector transformations are done and also various forms of culling. Take note this stage might overflow dsp buffers! 3a: Receives primitive data from dsp and outputs it to the viewport. Also bounding rectangles are sent and are required to be read by the cpu(!). The whole primitive reception is synchronized. 3b: Sorts the primitive-index-table with a combsorting algorithm. This means less than linear performance! This is in-place and hence does not overflow anything. 4b: This sends primitives and bounding rectangles to the cpu. Polygon- clipping is done on dsp, line- and sprite-clipping on cpu. This stage is synchronized. Limitations: There are numerous. The dsp has little memory, so object and scene complexity are limited. Let's start with the biggest drawbacks: - Viewport width = [1..384] (*) - Viewport height = [1..256] (*) - amount of scene (transformed) objects =< 32 - total size stored untransformed objects =< 8192 words - amount of transformed vertices =< 700 (*) - amount of transformed primitives =< 1000 - total commandlist size (per cycle) =< 512 words Quite some limitations. That is the tradeoff for parallel processing. Make sure neither the commandlist nor the vertex/primitive buffering limits of the dsp are exceeded or this may cause unpredictable results! Since the dsp is responsible for texturing it can cache some small bitmaps for faster rendering. It can also be set to send offsets when it is initialized in stage 0. This way bitmaps are kept on the cpu side. The dsp bitmap storage space: - 2 64*64 highcolor bitmaps - 2 64*64 7bpp bitmaps All may be present in the dsp ram at the same moment. * : You may alter these to finetune the viewportsize<->scene complexity tradeoff. Rendering specifics: Perspectivation uses a multipass inverse-table algorithm. This is not as accurate as a normal division, but is over twice as fast. The error is below 1%. Renderings might appear upto 1% smaller than they should be. Primitives are painted in two manners: a) cpu paints primitive with vertices given by the dsp b) cpu paints a polygon given entirely by the dsp (a) is used for lines, sprites and flat polygons (b) is used for shaded polygons The use of (b) is the speedgain using the dsp for interpolation algorithms. This mode draws n-polys without splitting them into triangles. Hence the texturing-slopes are recalculated with every scanline. Special: Enough (only barely) dsp ram is left to allow a small dsp-loader and interrupt-routine to be present along with audio-mixage functions. I used it with the exa-mixer and it worked fine.