Register File

The hierarchical annals file model exploited in Trace is evaluated against the nonhierarchical i with variations of either annals communication bandwidth or register communication latency.

From: Advances in Computers , 2014

Fault Tolerance in Calculator Systems—From Circuits to Algorithms*

Shantanu Dutt , ... Fran Hanchek , in The Electrical Technology Handbook, 2005

viii.5.three Support for Microrollback in Cache Retention

As in the register file, a DWB is used to support microrollback in cache retentivity. The only modification for a cache, presented by Tamir and Tremblay (1990), is the utilise of two CAMs instead of one. In this case, each CAM checks only one function of the address, and, thus, the response time of the DWB is smaller. This is done to accomplish better performance on cache microrollback back up. The cache CAMs' behavior for R/W operations and for rollback is exactly the same every bit that of the CAM for the register file.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780121709600500347

Annals-Level Communication in Speculative Chip Multiprocessors

Milan B. Radulović , ... Veljko Yard. Milutinović , in Advances in Computers, 2014

4.4.5 Scalability Issues

The register-level synchronization and communication back up are also analyzed from the scalability point of view. The relevant scalability issues are related to register file organization, register featherbed network, and support for register synchronization.

4.4.v.one Annals File Organization

Although the GRF organisation (e.g., in MP98 (Merlot) and MAJC) saves the bandwidth in extremely minor multiprocessor designs like CMPs, it imposes complex wiring, increases in number of read and write ports, and affects the cycle time and latencies. Hence, nearly of speculative CMP designs exploit distributed annals structures to alleviate aforementioned issues noticed in designs with a shared register file organization.

4.iv.5.ii Back up for Register Communication

The hardware back up for register value communication in Multiscalar, Multiplex, and NEKO requires additional register storage banks and sets of bit masks per each core, which results in complex core designs. The support for speculative execution and prediction in SM is completely based on hardware mechanisms, which in turn heavily increases the complication of the overall design. Atlas uses on-bit value/control predictor that communicates with the other components through a shared coach, which impacts also area and ability efficiency.

Since augmentation of the SS in IACOMA depends on the number of cores, information technology would not be scalable due to the increase of hardware complexity of LC, register availability, and check-on-store logic, placed per each cadre. Also, the increase of number of TUs in the Mitosis will cause an increase in size of the RVT structure since one of its dimensions is related to the number of TUs.

iv.4.5.3 Annals Featherbed Network

The band interconnect, as a natural choice for TLS technique, is exploited for register transfer betwixt cores at Multiscalar, Multiplex, SM, Atlas, NEKO, and Pinot. Although its hardware cost and latency hop are minor, it limits the ability to optimize data locality and to perform efficient multiprogramming. Also, the ring suffers from a fundamental scaling trouble since it does not scale with the number of nodes. The scalability evaluation of dissimilar Atlas compages layouts of 4, 8, and 16 nodes has shown that an efficient architecture for execution of sequential binaries can exist attained. Nevertheless, the increase in number of cores (over 16 cores) results in worse operation (control and information become less predictable), surface area, and power efficiency than in a uniprocessor [21].

Trace, IACOMA, and Mitosis employ either set of global buses or shared motorbus between processor cores for register-level advice. The overall performance depends not only on the amount of information transferred simply also on the charabanc protocol, the passenger vehicle design parameters (bus width, priorities, data block size, czar handshake overhead, etc.), clock frequency (which depends on complexity of the interface logic, routing of the wires, and placement of the various components), and the components' activity. The simple bus would be sufficient to handle a small number (4–8) of processor cores in a CMP, but more cores or fifty-fifty faster cores would require higher bandwidth, which in plough demands either more buses or hierarchy of buses. Since communication over a bus can exist a plush solution, this is likely a limitation factor for PE complexity and overall scalability and operation for those processors.

Finally, the scalability issues in the aforementioned speculative CMP architectures that support register-level communication are summarized in Table 1.viii.

Tabular array 1.8. Scalability Isues

CMP Scalability Issues
Multiscalar [14] Complex support for annals synchronization and communication per each core
Ring interconnect for register transfer
Multiplex [8] Complex support for register synchronization and advice per each cadre
Band interconnect for register transfer
SM [sixteen] Hardware mechanisms for speculation and prediction
Band interconnect for register transfer
MP98 (Merlot) [18] Tightly coupled compages
Global register file, port requirements, long wires
MAJC [17] Tightly coupled architecture
Global annals file, port requirements, long wires
Trace [xv] Global register file, port requirements, long wires
Global and local result buses
IACOMA [9] Register availability logic per each core
Last-copy control logic per each core
Cheque-on-store logic per each cadre
Shared coach
Atlas [35] Value control predictor
Shared bus
Ring interconnect for register transfer
NEKO [22] Circuitous support for annals synchronization and communication per each core
Ring interconnect for register transfer
Pinot [23] Ring interconnect for register transfer
Mitosis [nineteen] RVT structure
Band interconnect for register transfer

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780124202320000015

Custom Memory Organisation and Data Transfer: Architectural Bug and Exploration Methods

Francky Catthoor , ... Arnout Vandecappelle , in The Electrical Engineering Handbook, 2005

ii.2.ii Register Files and Local Retention Organization

This subsection discusses the register file and local memory organization. An illustrative system for a dual port register file with ii accost busses, where the separate read and write addresses are generated from an address calculation unit (ACU), is shown in Figure ii.4. In this example, two data busses (A and B) are used simply only in one management, and so the write and read addresses directly command the port admission. In general, the number of unlike address words can be smaller than the number of port(south) when they are shared (e.chiliad., for either read or write), and the busses tin exist bidirectional. Additional control signals determine whether to write or read and for which port the address applies. The number of address bits per word is log2 (N). The annals file of Figure 2.four tin can exist used very efficiently in the feedback loop of a data path every bit already illustrated in Figure ii.1. In general, the file is used only for the storage of temporary variables in the application running on the data path (sometimes also referred to every bit execution unit). Such register files (regfiles) are also used heavily in almost modern general-purpose RISCs and especially for modernistic multimedia-oriented betoken processors that have regfiles up to 128 locations. 1 For multimedia-oriented VLIW processors or recent super-scalar processors, regfiles with a very large admission bandwidth, up to 17 ports, are provided (Jolly, 1991). Application-specific instruction-set processors (ASIPs) and custom processors make heavy utilize of regfiles for the same purpose. It should be noted that although information technology has the clear advantage of very fast access, the number of information words to exist stored should be minimized equally much equally possible due to the ability- and surface area-intensive structure of such register files (both due to the decoder and the cell overhead). Detailed excursion issues will not be discussed here (for review, see Weste and Eshraghian [1993]).

FIGURE 2.iv. Regfile with Both R and West Addresses

After this brief discussion of the local foreground memories, we will now continue with on- and off-chip background memories of the random access blazon.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780121709600500190

LOWER BOUND ON LATENCY FOR VLIW ASIP DATAPATHS*

Margarida F. Jacome , Gustavo de Veciana , in Readings in Hardware/Software Co-Design, 2002

five Related work and benchmark examples

In the context of distributed annals files, if i wants to consider the deleterious effect of required data object moves on the latency of a schedule, one must explicitly consider a bounden of the dataflow nodes to the functional units in the datapath. The basic problem formulated and addressed in this newspaper is thus different from those considered in [ six, 11], for they assume no data transfer delays. However, one can employ these techniques to the dataflow later on a binding function has been adamant. Indeed, by making each functional unit a singled-out resource type with capacity 1, and the bus a resources blazon with a specific capacity, these methods tin can also be fabricated binding specific. Given this, 1 tin compare the absolute quality of our lower bound with that reported in [half-dozen, 11]. With few exceptions [11] performs amend than [6], thus we shall compare our work with an implementation of the algorithm in [11].

Table 1 summarizes our results. Several benchmark dataflows were bound to the datapath shown in Fig. 1. Initial and improved bindings were obtained manually based on the uncomplicated heuristics discussed in §4. Columns 2 and 4 of the table testify the minimum doable latency for centralized and for distributed register file structures, respectively. Differences betwixt these indicate the crudeness of assuming a centralized register file structure when it is in fact distributed. Starred entries are known to exist optimal latencies over all possible bindings, thus the improvement heuristic was effective.

Table ane. Experimental results.

DFG Central. RF Binding Distrib. RFs Lower Our
L * Bds [11]
FFT Butterfly [3] 4 initial viii eight half dozen
imprvd. 5* five four
4th order Avenhous Filter [five] seven initial 10 ten 9
imprvd. 9* 9 9
4th order IIR Filter retimed [three] four initial 9 ix 8
imprvd. 6* half-dozen 5
Beamforming Filter (3 beams) [nine] 4 initial 8 eight vii
imprvd. six* vi five
AR Filter [ii] 8 initial 15 13 14
imprvd 13 thirteen 13

Our lower bound on latency L *, shown in column 5, was consistently tight and for seven of the ten benchmarks outperformed [11].

In improver, note that [6, eleven] only generate bounds on the primeval possible execution time of individual nodes in the DFG, and then, the information on serialization (for FUs and buses) that we capture via the WDG is non available. Since the latency of a schedule can vary significantly for different bindings, particularly for datapaths with distributed register files, our arroyo has a significant added value, in that it tin can provide guidance on how to change binding functions to achieve lower latencies.

Code generation for VLIW ASIPs has been addressed extensively in the literature, run across e.1000., [8, 7]. Although discussing this work is beyond the scope of this paper, to further illustrate the relevance of the merchandise-off information captured past the WDG, we volition briefly discuss the AVIV code generator[4]. This work specifically considers the same trade-offs, while deriving a functional unit binding/assignment for a given expression tree.

As discussed below, AVIV greedily prunes binding alternatives based on a local cost part. Given an expression tree, an ASAP schedule of the expression tree is performed, and nodes (operations) on the resulting levels are sequentially considered (in any order) from the lowest to the highest level. As the operations are considered, a search tree is constructed, representing possible bounden alternatives. Heuristically junior alternatives are immediately pruned – based on a local toll function. The price associated with binding an operation to a functional unit is the sum of 1) the number of required data transfers given the bindings made for the antecedent nodes of that particular path of the decision tree, and 2) the number of operations at the current level that are assigned to the same functional unit, again because the bindings for the antecedent nodes. While this greedy policy would execute faster than our lower bound algorithm, it makes decisions strictly based on local information. Thus, for example, information technology does not discriminate amidst operations that accept different mobility (i.e., scheduling windows), which can compromise the overall quality of the binding. An iterative improvement algorithm using the WDG tin instead create binding alternatives based on a more "global" view of such tradeoffs, at the expense of an increment in runtime. This concludes our discussion of the relevance to code generation of the tradeoffs explicitly modeled in our approach.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9781558607026500429

Microarchitecture

Sarah 50. Harris , David Harris , in Digital Blueprint and Computer Compages, 2022

Decode

The second step is to read the register file and decode the instructions. The command unit decodes the pedagogy, that is, figures out what performance should exist performed based on op, funct3, and funct7 5 . In this state, the processor also reads the source registers, rs1 and rs2, and puts the values read into the A and WriteData nonarchitectural registers. No control signals are necessary for these tasks. Figure 7.30 shows the Decode state in the Main FSM and Effigy 7.32 shows the flow through the datapath during this state in medium blueish lines. Subsequently this step, the processor tin differentiate its actions based on the education because the educational activity has been fetched and decoded. We will first show the remaining steps for lw, and then keep with the steps for the other RISC-V instructions.

Figure vii.30. Decode

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128200643000076

ARM PROCESSOR FUNDAMENTALS

ANDREW North. SLOSS , ... CHRIS WRIGHT , in ARM Organization Developer's Guide, 2004

two.2.ii BANKED REGISTERS

Figure ii.4 shows all 37 registers in the register file. Of those, twenty registers are hidden from a program at different times. These registers are chosen banked registers and are identified by the shading in the diagram. They are available only when the processor is in a item mode; for case, abort way has banked registers r13_abt, r14_abt and spsr_abt. Banked registers of a particular fashion are denoted past an underline character post-fixed to the mode mnemonic or _mode.

Figure ii.iv. Consummate ARM register gear up.

Every processor manner except user mode can alter fashion by writing straight to the mode $.25 of the cpsr. All processor modes except system mode have a set of associated banked registers that are a subset of the principal 16 registers. A banked register maps one-to-one onto a user manner register. If yous modify processor mode, a banked register from the new way will replace an existing annals.

For example, when the processor is in the interrupt request way, the instructions you execute all the same access registers named r13 and r14. However, these registers are the banked registers r13_irq and r14_irq. The user mode registers r13_usr and r14_usr are not affected past the didactics referencing these registers. A program yet has normal access to the other registers r0 to r12.

The processor way can exist changed by a plan that writes directly to the cpsr (the processor cadre has to be in privileged mode) or by hardware when the core responds to an exception or interrupt. The following exceptions and interrupts cause a mode alter: reset, interrupt request, fast interrupt request, software interrupt, data arrest, prefetch arrest, and undefined teaching. Exceptions and interrupts suspend the normal execution of sequential instructions and leap to a specific location.

Figure two.5 illustrates what happens when an interrupt forces a mode change. The figure shows the core changing from user mode to interrupt asking mode, which happens when an interrupt request occurs due to an external device raising an interrupt to the processor core. This alter causes user registers r13 and r14 to be banked. The user registers are replaced with registers r13_irq and r14_irq, respectively. Note r14_irq contains the return address and r13_irq contains the stack arrow for interrupt request manner.

Effigy ii.v. Changing mode on an exception.

Effigy ii.5 as well shows a new register appearing in interrupt request mode: the saved program status register (spsr), which stores the previous mode's cpsr. You can see in the diagram the cpsr being copied into spsr_irq. To return back to user style, a special render pedagogy is used that instructs the core to restore the original cpsr from the spsr_irq and bank in the user registers r13 and r14. Notation that the spsr can only be modified and read in a privileged fashion. There is no spsr bachelor in user mode.

Some other important feature to note is that the cpsr is not copied into the spsr when a mode change is forced due to a plan writing directly to the cpsr. The saving of the cpsr only occurs when an exception or interrupt is raised.

Figure 2.3 shows that the current active processor mode occupies the v to the lowest degree meaning bits of the cpsr. When power is practical to the core, it starts in supervisor mode, which is privileged. Starting in a privileged mode is useful since initialization code tin use full admission to the cpsr to fix up the stacks for each of the other modes.

Tabular array ii.i lists the various modes and the associated binary patterns. The final column of the table gives the flake patterns that correspond each of the processor modes in the cpsr.

Table 2.i. Processor mode.

Mode Abridgement Privileged Mode[4:0]
Abort abt yep 10111
Fast interrupt request fiq yes 10001
Interrupt asking irq yes 10010
Supervisor svc yes 10011
Organization sys yes 11111
Undefined und yes 11011
User usr no 10000

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B9781558608740500034

Optimizing DSP Software

Robert Oshana , in DSP Software Development Techniques for Embedded and Real-Time Systems, 2006

Merchandise-offs

The drawback to loop unrolling is that it uses more registers in the register file as well as execution units. Dissimilar registers need to be used for each iteration. Once the available registers are used, the processor starts going to the stack to store required information. Going to the off-flake stack is expensive and may wipe out the gains achieved by unrolling the loop in the first identify. Loop unrolling should only be used when the operations in a unmarried iteration of the loop do not utilize all of the bachelor resources of the processor architecture. Cheque the assembly language output if you lot are not certain of this. Another drawback is the code size increase. As you tin see in Effigy half-dozen.24, the unrolled loop, albeit faster, requires more than instructions and, therefore, more retentivity.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780750677592500089

Synchronization

Ronald Mehler , in Digital Integrated Excursion Pattern Using Verilog and Systemverilog, 2015

FIFO

The more general solution for a blueprint with an asynchronous passenger vehicle input is to apply a FIFO, a Get-go In, Beginning Out queue.

A common occurrence in digital communications is to have a burst of high-speed data that need to be candy over time. This is illustrated in Effigy 7.22, where the Clock two timing domain sends a byte of data to Clock ane's domain, where it tin exist processed at a slower rate.

Figure 7.22. Data burst crossing clock domains

A situation like this cannot exist handled by the multiplexor synchronizer illustrated in Figure seven.21. A FIFO is required.

A FIFO provides timing isolation between clock domains. Data may be sent into the FIFO with one clock and read out with a unlike, asynchronous one. A simple protocol is typically congenital in to each FIFO: if the FIFO is full, more information may not be sent. If the FIFO is empty, nothing may be read from it. Even with a FIFO, the long-term data rates must exist identical. More data can never be put into one side than is read out from the other.

In a true FIFO, each data word passes through all unused stages of the FIFO until information technology either arrives at the output or is blocked by a full phase. This type of FIFO involves the use of cascaded latches and a nonoverlapping two-phase clock. The blueprint is complicated and slow. Information technology is non commonly used anymore. Instead, mod FIFO designs use a memory configured as a circular buffer.

A circular buffer FIFO consists of iv subdesigns:

1.

Memory, either a RAM or a register file

2.

Read arrow

three.

Write pointer

four.

Control logic and empty/full flag generators.

The top level of a FIFO design of this type is shown in Figure 7.23 [iv].

Figure 7.23. FIFO top-level cake diagram

With such a design, the speed is virtually independent from the size, different a true FIFO, where speed is inversely proportional to depth. A circular buffer FIFO is every bit deep and wide equally the retentivity cell it uses. The retentiveness jail cell is besides likely to be the component that limits the FIFO's operating speed, so the FIFO in all probability will be as fast as memory applied science allows.

The FIFO model developed in this department volition utilise a annals file rather than a RAM cell. Although register files have the disadvantages of existence larger and more power hungry than retentivity cells, they have the advantages of existence synthesizable without any additional tools and matching perfectly to the needs of a FIFO. They besides can be fast. Being made of flipflops, each cell naturally has an input port and an output port, exactly what is needed for a FIFO. While a dual-port retentivity is often used for FIFOs, a annals file simplifies the design. The annals file developed in previous chapters uses a single bidirectional data port, but that would not be suitable for a FIFO, as that would require arbitrating access to the port and would make simultaneous reads and writes impossible. Appropriately, the register file will exist modified to have a data input port and a data output port.

In most commercial designs, merely the smallest FIFOs would employ a register file, as the ability savings of using a memory cell would override the other concerns for larger applications.

The algorithm for FIFO operation consists of only three procedures:

ane.

Initialize to empty.

2.

Allow write if not full.

three.

Permit read if non empty.

Reading from an empty FIFO would be an mistake, as a random information word would be accepted as valid data. Writing to a full FIFO would likewise exist an error, as valid data waiting in the queue would be overwritten by newer data. Reading from an empty FIFO is known as underflow. Writing to a full one is overflow. Both need to be avoided.

Considering the objective of the FIFO is to span two clock domains, implementing this algorithm will require operations that use data from the two different domains.

The FIFO will write to and read from every retention location in sequence. This will require a write address and a read address, each pointing to locations in the same retention. Comparisons of the two pointers volition be needed to determine if the FIFO is full, empty, or neither.

Comparison of the pointers is the about complicated part of the FIFO blueprint, because 1 is generated from the read side clock and the other comes from the write side. If the design were synchronous, the comparison would exist piece of cake, merely if the design were synchronous, the FIFO might not be needed at all.

Empty and full are determined past comparisons of the two pointers. Because the retention is used as a circular buffer, any value of the pointers tin can mean empty, full, or something in between.

The following sequence of events illustrates the inadequacy of but comparing pointers to decide if the FIFO is full, empty, or neither empty nor full.

At initialization, the empty flag is ready true and the total flag faux. Both pointers are fix to zero. Since the FIFO is empty, write is the only permissible performance.

Afterward the first write, the write pointer volition be at i and the read pointer all the same at nix. If a read and then follows, both pointers will be at 1. The FIFO will be empty again and the empty flag should be set.

Then there are N writes for an North deep FIFO without any reads. The write pointer will wrap around and volition over again be at one. The FIFO will be full, so the full flag should be gear up.

In the absenteeism of any other information, having the pointers equal thus ways both total and empty. Since it is not possible for a FIFO to be simultaneously total and empty, something else needs to exist added to the design.

So a match betwixt read and write pointers sometimes means the FIFO is empty and sometimes means that it is full. If it is a read operation that causes the pointers to friction match, then the FIFO is empty. If the write pointer catches up to the read as a result of a write operation, the FIFO is full. Simply comparing without knowing which 1 changed last is insufficient to decide if the FIFO is empty or full.

Expanding both pointers by ane flake can solve this problem. If each pointer has an actress chip and if all bits are equal, then the FIFO is empty. If the nearly significant $.25 are opposite simply all others are equal, and so the FIFO is full.

A further complication is that the 2 pointers cannot be straight compared, as they are asynchronous. One is a role of the write clock, the other of the read clock. However, compared they must be, as that is the only manner to determine if the device is total or empty and if the flag status needs to exist changed. Since the total flag needs to be referenced past the write side and the empty flag by the read side, each pointer must cross over to the other clock domain.

The whole point of the FIFO is to bridge a clock domain. Since at present it is apparent that internal to the FIFO there need to be ii data buses (the read arrow and the write pointer) crossing clock domains, this would seem to be a geometrically expanding recursive problem, but in reality information technology ends with these 2 crossings and the problem is manageable.

Bold that the counters are at least 2 bits each, then the data coherence problem illustrated in Effigy 7.20 will need to be solved. Every bit an example of how the problem tin can occur, consider a 4-fleck pointer transitioning from seven to eight. In that case, there would be three one to zero transitions and ane goose egg to one. If the faster ane to zero transitions are all read with their new values merely the zero to 1 transition is defenseless before reaching its transition threshold, the pointer would be latched in with a new value of goose egg instead of viii, resulting in an incorrect comparing with the other arrow.

In a binary counter, multiple bits can change on every cycle, leading to the trouble of a clock domain crossing corrupting data. However, binary is not the simply possible counter encoding. Grey counters, which take the distinguishing characteristic of only having a unmarried flake alter on any increment or decrement, can be used to cross the clock domain divide.

Grey codes for the values zero through 15 are shown in Table vii.2.

Table 7.2. Decimal, binary, and Greyness values

Decimal Binary Gray Decimal Binary Greyness
0 0000 0000 eight thou 1100
one 0001 0001 9 1001 1101
2 0010 0011 ten 1010 1111
three 0011 0010 eleven 1011 1110
4 0100 0110 12 1100 1010
5 0101 0111 xiii 1101 1011
6 0110 0101 14 1110 1001
7 0111 0100 15 1111 1000

Grey sequences do have a limitation: In that location are none for odd numbers. Only even numbers may be represented with a Gray sequence. With a binary counter, it is possible to wrap back to goose egg afterwards any capricious value is reached. With a Gray counter, there is not whatsoever way to do that for odd numbers and retain the defining characteristic of a Gray sequence, which is that merely i bit may modify in any given cycle. This effectively limits FIFO depths to even numbers. If a memory cell is used rather than a annals file, this is not a serious limitation, as retentivity cells typically not simply e'er have an fifty-fifty number of locations, they are a power of two deep. A register file can be made any size but for utilize in a FIFO it must at least have an fifty-fifty number of elements.

Manually coing a Gray counter of arbitrary size is possible simply dull. A simple way of achieving the desired outcome is by using a Gray to binary encoder, every bit shown in Figure 7.24. That is a four-bit instance, only it tin hands exist expanded to whatever size as long every bit the same construction, including the anomalous about significant bit, is maintained. Code for a binary to Gray encoder was shown in Affiliate v. The conversion could as well be washed in a function, every bit shown in Figure vii.25.

Figure seven.24. Four-scrap binary to Gray converter

Effigy vii.25. A part for binary to Gray conversion and a call to that function

A Grayness counter could exist created by putting a binary to Grayness converter on the output of a binary counter and a Gray to binary converter on the input. A Gray to binary converter is shown schematically in Effigy vii.26. The converter is a ripple process, and then its delays growth linearly with counter size, or logarithmically with FIFO depth.

Figure 7.26. Gray to binary converter

As will be seen soon, there is no demand to use the Gray to binary converter, although its apply can be convenient. If a Gray counter is wanted rather than just a Gray conversion of a binary counter and coding i of the advisable size is also ho-hum, some synthesis tools will generate one automatically when instructed to practice then and given a excursion description coded every bit a land machine. A part for a Gray to binary converter is shown in Figure 7.27. Although the function appears on both sides of an assignment operator, it is not recursive and is synthesizable.

Figure seven.27. Code for a Grey to binary conversion role

A Gray count can be sent across an asynchronous divide without chance of data coherence abuse, but replacing binary address pointers with Gray would create a new problem.

The issue can be illustrated by the post-obit example using an 8-deep FIFO with 4-bit Gray accost pointers. At that place are eight writes followed by 7 reads. With both pointers initialized to null, this will go out the write arrow with a value of 1100 and the read pointer stopped at 0100. They are equal except for the MSB. According to the empty/full algorithm used with binary pointers, the FIFO is full, only it is not. It merely has one word remaining in it.

Fifty-fifty worse, using the three least significant bits of the write pointer equally the write address will cause the 9th word to be written to the eighth address rather than the first; thus, overwriting information and causing irrecoverable errors. This problem arises considering Gray codes are symmetrical nigh their midpoint rather than repeating from the beginning as binary does. Comparing Gray pointers works for determining that the FIFO is empty only fails for determining if it is full. Farther, using Gray codes as an address leads to data loss.

A solution to making the sequence repeat from the beginning rather than backtrack about the midpoint would exist to replace the second most meaning scrap with its inverse when the counter is in the 2nd one-half of its sequence. While this sounds complicated, in practice all it ways is replacing that fleck with the XOR of itself and the about significant bit.

The count values, Grey coding, and the desired memory addresses are shown in Table 7.iii. The circuit for accomplishing this is shown in Figure 7.28. Once the Greyness count has been made, the boosted toll is only a unmarried XOR gate.

Table 7.3. Gray code and repeating accost sequence

Decimal Grayness Address Decimal Gray Accost
0 0000 000 8 1100 000
ane 0001 001 9 1101 001
2 0011 011 10 1111 011
3 0010 010 eleven 1110 010
4 0110 110 12 1010 110
5 0111 111 13 1011 111
vi 0101 101 14 1001 101
7 0100 100 fifteen m 100

Figure vii.28. Converting Gray sequence to repeating address pointer

With the 2d to most significant bit inverted half the time, the algorithm for determining if the FIFO is total needs another adjustment. When the FIFO is full, the most significant bits should be opposite. This is unchanged from a binary comparison. Yet, when the most significant bits are opposite, using the fleck inversion scheme illustrated in Figure 7.28, the second to about significant bits of one of the pointers will have been inverted. Thus, when the FIFO is total, the two nigh meaning bits volition be opposite and rest the aforementioned.

This comparison algorithm works, only the modified sequence is no longer a Grayness lawmaking. Attempting to use the Northward + 1 bits of a Greyness counter with the second to MSB inverted to cross the clock domain boundary would atomic number 82 to a high failure charge per unit, as in that location would once again be the potential for communicable i bit with a new value and another with an old one when both are transitioning simultaneously.

However, there is no need to use the aforementioned bit blueprint for addressing the memory and for comparing to the pointers to determine empty/full condition. The two counters simply need to run in lock pace. Binary counters can be used for the addresses. Each counter then needs a Gray converter. Grayness signals cross the clock domain boundary in each management. Once synchronized with the other clock, the Gray lawmaking tin can be modified to change the second to the most meaning bit every bit shown in Figure 7.28. Using evidently binary for the address has the additional advantage of speed. Since the retentiveness is likely to be the limiting cistron on the entire FIFO design, whatsoever delay on converting the accost will be in the critical path. Using patently binary for the address moves the conversion to Gray to be parallel with the memory access rather than in series with it.

A block diagram of an addressing algorithm and total flag logic is shown in Figure 7.29. The two discrete XOR gates would best exist incorporated into the full flag comparator design but are shown on the diagram to analyze the intent. In the comparator, the second to MSB of the Gray buses would be replaced past the XOR gate outputs. For reliable operation, a two-stage synchronizer will be needed in the Write Clock Domain on the Greyness omnibus coming from the Read Clock Domain. It is not shown in the diagram, only i is included in Figure 7.30 where the data flow in the reverse direction is illustrated. The retentiveness cell is shown in both diagrams. There is just 1 memory prison cell in the design. It is non replicated for both paths.

Figure seven.29. Addressing and full flag logic; synchronizer non shown

Figure 7.30. Empty flag data flow

An alternative algorithm is to convert the Northward + one scrap Grayness bus back to binary after crossing the clock domain divide. This would accept a few more gates but it would not likely slow the circuit downward because information technology would be done in parallel to the retentiveness access. It has the reward of making the flags easy to suit. The FIFO can be designed to signal back to the reading and writing devices its condition before information technology is totally full or empty. For case, the FIFO may exist gear up to signal to the write side when only ii words remain free. This feature tin can be essential in a pipelined functioning when the data flow cannot respond instantaneously to a flag.

Because the empty flag is gear up when the pointers are equal for both binary and Gray codes, no extra processing is needed. If only an empty flag is needed and the receiving side volition never need to know if the FIFO is nigh empty, the fastest and easiest thing to exercise is to compare Gray codes. Otherwise, converting the Grey code from the write side back to binary may be the more than practical approach.

Read total chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012408059100007X

Digital Building Blocks

Sarah L. Harris , David Harris , in Digital Pattern and Calculator Compages, 2022

5.5.5 Register Files

Digital systems often employ a number of registers to shop temporary variables. This group of registers, called a register file , is commonly congenital as a small, multiported SRAM assortment because it is more compact than an array of flip-flops. In some register files, a item entry, such as register 0, is hardwired to always read the value 0 considering 0 is a usually used constant.

Figure v.49 shows a 32-register × 32-bit three-ported register file built from the three-ported retentiveness of Figure five.45. The register file has 2 read ports (A1/RD1 and A2/RD2) and one write port (A3/WD3). The 5-chip addresses—A1, A2, and A3—tin can each admission all ii5 = 32 registers. So, two registers can be read and one annals written simultaneously.

Figure five.49. 32 × 32 register file with two read ports and 1 write port

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780128200643000052

Circuit Methodology

David Harris , in Skew-Tolerant Circuit Design, 2001

4.2.3 Special Structures

In a existent organisation, skew-tolerant domino circuits must interface to special structures such as memories, register files, and programmable logic arrays (PLAs). Precharged structures like register files are duplicate in timing from ordinary domino gates. Indeed, standard six-transistor register cells can produce dual-track outputs suitable for firsthand consumption past dual-rails domino gates.

Sure very useful dynamic structures such as wide comparators and dynamic PLAs are inherently nonmonotonic and are conventionally congenital for high performance using self-timed clocks to signal completion. The problem is that these structures are near efficiently implemented with cascaded wide dynamic gates because the delay of a dynamic NOR structure is only a weak part of the number of inputs. Generally, dynamic gates cannot be direct cascaded. However, if the second dynamic gate waits to evaluate until the outset gate has completed evaluation, the inputs to the second gate volition be stable and the excursion will compute correctly. The challenge is creating a suitable filibuster between gates. If the delay is also long, time is wasted. If the filibuster is likewise short, the 2d gate may obtain the wrong result.

A common solution is to locally create a self-timed clock by sensing the completion of a model of the get-go dynamic gate. For example, Figure 4.21 shows a dynamic NOR-NOR PLA integrated into a skew-tolerant pipeline. The AND airplane is illustrated evaluating during ϕ2, and adjacent logic tin can evaluate in the same or nearby phases. andclk is nominally in phase with ϕii, merely has a delayed falling edge to avoid a precharge race with the OR aeroplane. The latest input X to the AND plane is used past a dummy row to produce a self-timed clock orclk for the OR aeroplane that rises after AND plane output Yhas settled. Notice how the falling edge of orclk is not delayed so that when Y precharges high the OR plane volition not be corrupted. The output Z of the OR plane is and so indistinguishable from whatever other dynamic output and tin be used in subsequent skew-tolerant domino logic.

Effigy 4.21. Domino/PLA interface

Read total affiliate

URL:

https://www.sciencedirect.com/science/commodity/pii/B9781558606364500046