This commit introduces a different interface to submit transfers, using
DMA descriptors.
The structure of the DMA descriptor is as follows:
struct dma_desc {
u32 flags,
u32 id,
u64 dest_addr,
u64 src_addr,
u64 next_sg_addr,
u32 y_len,
u32 x_len,
u32 src_stride,
u32 dst_stride,
};
The 'flags' field currently offers two control bits:
- bit 0: if set, the transfer will complete after this last descriptor
is processed, and the DMA core will go back to idle state; if cleared,
the next DMA descriptor pointed to by 'next_sg_addr' will be loaded.
- bit 1: if set, an end-of-transfer interrupt will be raised after the
memory segment pointed to by this descriptor has been transferred.
The 'id' field corresponds to an identifier of the descriptor.
The 'dest_addr' and 'src_addr' contain the destination and source
addresses to use for the transfer, respectively.
The 'x_len' field contains the number of bytes to transfer,
minus one.
The 'y_len', 'src_stride' and 'dst_stride' fields are only useful for
2D transfers, and should be set to zero if 2D transfers are not
required.
To start a transfer, the address of the first DMA descriptor must be
written to register 0x47c and the HWDESC bit of CONTROL register must
be set. The Scatter-Gather transfer is queued similarly to the simple
transfers, by writing 1 in TRANSFER_SUBMIT.
The Scatter-Gather interface has a dedicated AXI-MM bus configured for
read transfers, with its own dedicated clock, which can be asynchronous.
The Scatter-Gather reset is generated by the reset manager to reset the
logic after completing any pending transactions on the bus.
When the Scatter-Gather is enabled during runtime, the legacy cyclic
functionality of the DMA is disabled.
Signed-off-by: Ionut Podgoreanu <ionut.podgoreanu@analog.com>
Due to nets being optimized at IP-level during the no-OOC synthesis flow,
constraints related to req_clk (request clock) were not being applied,
causing the design to not meet timing.
The fix considers the synchronous modes, appending the possible resulting
req_clk's names after the synthesis flow.
Due to grounded signals in the DMA_TYPE_SRC != DMA_TYPE_STREAM_AXI config.,
sync_rewind is removed during synthesis, even so, constraints were
trying to be applied to those nets.
To resolve this, sync_rewind block was moved to inside the generate.
Vivado seems to properly suppress "Empty list" warnings when the circuit does not exist because of a generate rule.
Signed-off-by: Jorge Marques <jorge.marques@analog.com>
Signed-off-by: Ionut Podgoreanu <ionut.podgoreanu@analog.com>
* Added header license for the files that didn't have
* Modified parentheses
* Removed extra spaces at the end of lines
* Fixed parameters list to be each parameter on its line
* Deleted lines after endmodule and consecutive empty lines
* Fixed indentation
Signed-off-by: Iulia Moldovan <iulia.moldovan@analog.com>
On architectures with ports that support cache coherency, the AWCACHE
signal must be set to indicate that transactions are cached. This patch
adds a parameter allowing AWCACHE to be set on an AXI4 destination port.
This patch addresses the following issue:
In 2D mode when consecutive partial transfers occur, and the latter is
very short, will interfere with the completion mechanism of the first
transfer leading to uncompleted segments and unreported partial
transfers.
The DMAC has the requirement that the length of the transfer is aligned to
the widest interface width. E.g. if the widest interface is 256 bit or 32
bytes the length of the transfer needs to be a multiple of 32.
This restriction can be relaxed for the memory mapped interfaces. This is
done by partially ignoring data of a beat from/to the MM interface.
For write access the stb bits are used to mask out bytes that do not
contain valid data.
For read access a full beat is read but part of the data is discarded. This
works fine as long as the read access is side effect free. I.e. this method
should not be used to access data from memory mapped peripherals like a
FIFO.
This means that for example the length alignment requirement of a DMA
configured for a 64-bit memory and a 16-bit streaming interface is now only
2 bytes instead of 8 bytes as before.
Note that the address alignment requirement is not affected by this. The
address still needs to be aligned to the width of the MM interface that it
belongs to.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
FPGAs support different widths for the read and write port of the block
SRAM cells. The DMAC can make use of this feature when the source and
destination interface have a different width to up-size/down-size the data
bus.
Using memory cells with asymmetric port width consumes the same amount of
SRAM cells, but allows to bypass the re-size blocks inside the DMAC that
are otherwise used for up- and down-sizing. This reduces overall resource
usage and can improve timing.
If the ratio between the destination and source port is too larger to be
handled by SRAM alone the SRAM block will be configured to do partial up-
or down-sizing and a resize block will be inserted to take care of the
remaining up-/down-sizing. E.g. if a 256-bit interface is connected to a
32-bit interface the SRAM will be used to do an initial resizing of 256 bit
to 64 bit and a resize block will be used to do the remaining resizing from
64 bit to 32 bit.
Currently this feature is disabled for Intel FPGAs since Quartus does not
properly infer a block RAM with different read and write port widths from
the current ad_asym_mem module. Once that has been resolved support for
asymmetric memories can also be enabled in the DMAC.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
The DMA_LENGTH_ALIGN LSBs of all length For the most part the tools are
able to deduce this using constant propagation.
But this propagation does not work across the asynchronous meta data FIFO
in the burst memory module.
Add a DMA_LENGTH_ALIGN parameter to the burst_memory module which is used
to explicitly keep the LSBs of length registers on the destination side
fixed at 1'b1. This reduces resource use and improves timing by allowing
better constant propagation and unused logic elimination.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
For consistent simulation behavior it is recommended to annotate all source
files with a timescale. Add it to those where it is currently missing.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Data mover/ src axis changes
Request rewind ID if TLAST received during non-last burst
Consume (ignore) descriptors until last segment received
Block descriptors towards destination until last segment received
Request generator changes
Rewind the burst ID if rewind request received
Consume (ignore) descriptors until last segment received
If TLAST happened on last segment replay next transfer (in progress or
completed) with the adjusted ID
Create completion requests for ignored segments
Response generator changes
Track requests
Complete segments which got ignored
Length of partial transfers are stored in a queue for SW reads.
The presence of partial transfer is indicated by a status bit.
The reporting can be enabled by a control bit.
The progress of any transfer can be followed by a debug register.
Drive the descriptor from the source side to destination
so we can abort consecutive transfers in case TLAST asserts.
For AXIS count the length of the burst and pass that value to the
destination instead the programmed one. This is useful when the
streams aborts early by asserting the TLAST. We want to notify the
destination with the right number of beats received.
For FIFO source interface reuse the same logic due the small footprint
even if the stream does not got interrupted in that case.
For MM source interface wire the burst length from the request side to
destination.
Vivado recognises .h files as C header files,
the expected extension for Verilog Header is .vh
This causes issues in simulating block designs since these files
won't be exported for the simulation even if they are
part of the simulation fileset.
This change adds a diagnostic interface to the DMAC core.
The interface exposes internal information about the core,
information which can't be exposed through AXI registers
due the latency and update rate.
Such information is the fullness of the internal buffer.
For this is exposed in bursts and is driven from the destination
clock domain, as this is reflected in its name.
The signal has a fixed size and is dimensioned by
taking in account the supported maximum number of bursts of 128.
Data is gated on the source side interface and not let into the pipeline if
there is no space available inside the store and forward memory.
This means whenever data is let into the pipeline space is available and
backpressure wont be asserted. Remove the backpressure signals altogether
to simplify the design.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Currently the source side of the DMAC can issue requests for up to
2*FIFO_SIZE-1 bursts even though there is only room for FIFO_SIZE bursts in
the store and forward memory.
This can problematic for memory mapped buses. If the data is not read fast
enough from the DMAC back-pressure will propagate through the whole system
memory subsystem and can cause significant performance penalty or even a
deadlock halting the whole system.
To avoid this make sure that not more that than what fits into the
store-and-forward memory is requested by throttling the request ID based
on how much room is available in the memory.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
The second destination side register slice was put in place to provide
additional slack on some of the datapath control signals. It looks as if
this is no longer required for the latest version of the DMA controller.
All timing paths have sufficient margin.
So remove this extra slice register which just takes up resources and adds
pipeline latency.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Currently both the source side and the destination side interfaces employ a
beat counter to identify the last beat in a burst.
The burst memory already has an internal last signal on the destination
side. Exporting it allows the destination side interfaces to use it instead
of having to generate their own signal. This allows to eliminate the beat
counters on the destination side and simplify the data path logic.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Currently the destination side request ID is synchronized response ID from
the source side. This signal is effectively the same as the synchronized
src ID inside the burst memory. The only difference is that they might not
increment in the exact same clock cycle.
Exporting the request ID from the burst memory means we can remove the extra
synchronizer block.
This has the added bonus that the request ID will increment in the same
clock cycle as when the data becomes available from the memory.
This means we can assume that when there is a outstanding burst request
indicated via the ID that data is available from the memory and vice versa
when data is available from the memory that there is a outstanding burst
request.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Currently the DMAC uses a simple FIFO as the store-and-forward buffer. The
FIFO handshaking is beat based whereas the remainder of the DMAC is burst
based. This means that additional control signals have to be combined with
the FIFO handshaking signal to generate the external handshaking signals.
Re-work the store-and-forward buffer to utilize a BRAM that is subdivided
into N segments. Where N is the maximum number of bursts that can be stored
in the buffer and each segment has the size of the maximum burst length.
Each segment stores the data associated with one burst and even when the
burst is shorter than the maximum burst length the next burst will be
stored in the next segment.
The new store-and-forward buffer takes care of generating all the
handshaking signals. This means handshaking is generated in a central place
and does not have to be combined from multiple data-paths simplifying the
overall logic.
The new store-and-forward buffer also takes care of data width up- and
down-sizing in case that the source and sink modules have a different data
width. This tighter integration will allow future enhancements like using
asymmetric memory.
This re-work lays the foundation of future enhancements to the DMA like
support for un-aligned transfers and early transfer abort which would have
been much more difficult to implement with the previous architecture.
In addition it significantly reduces the resource utilization of the
store-and-forward buffer and allows for better timing due to reduced
combinatorial path lengths.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
For the memory-mapped AXI read interface the slave asserts rlast for the
last beat in a burst.
This means we don't have to count the number of beats to know when the
burst is completed but instead can use rlast. This slightly reduces the
amount of resources needed for the MM-AXI source module and given that the
beat_counter is often the bottleneck timing wise this should also improve
the timing.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
The DMAC allows a transfer to be aborted. When a transfer is aborted the
DMAC shuts down as fast as possible while still completing any pending
transactions as required by the protocol specifications of the port. E.g.
for AXI-MM this means to complete all outstanding bursts.
Once the DMAC has entered an idle state a special synchronization signal is
send to all modules. This synchronization signal instructs them to flush
the pipeline and remove any stale data and metadata associated with the
aborted transfer. Once all data has been flushed the DMAC enters the
shutdown state and is ready for the next transfer.
In addition each module has a reset that resets the modules state and is
used at system startup to bring them into a consistent state.
Re-work the shutdown process to instead of flushing the pipeline re-use the
startup reset signal also for shutdown.
To manage the reset signal generation introduce the reset manager module.
It contains a state machine that will assert the reset signals in the
correct order and for the appropriate duration in case of a transfer
shutdown.
The reset signal is asserted in all domains until it has been asserted for
at least 4 clock cycles in the slowest domain. This ensures that the reset
signal is not de-asserted in the faster domains before the slower domains
have had a chance to process the reset signal.
In addition the reset signal is de-asserted in the opposite direction of
the data flow. This ensures that the data sink is ready to receive data
before the data source can start sending data. This simplifies the internal
handshaking.
This approach has multiple advantages.
* Issuing a reset and removing all state takes less time than
explicitly flushing one sample per clock cycle at a time.
* It simplifies the logic in the faster clock domains at the expense of
more complicated logic in the slower control clock domain. This allows
for higher fMax on the data paths.
* Less signals to synchronize from the control domain to the data domains
The implementation of the pause mode has also slightly changed. Pause is
now a simple disable of the data domains. When the transfer is resumed
after a pause the data domains are re-enabled and continue at their
previous state.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
This reverts commit 4b1d9fc86b "axi_dmac: Modified in order to avoid
vivado crash".
Vivado no longer crashes and this structure is much more efficient when it
comes to resource usage and timing. The intention here is to create a 1-bit
memory that is N entries deep and not a N bit signal.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Xilinx tools don't allow to use $clog2() when computing the value of a
localparam, even though it is valid Verilog.
For this reason a parameter was used for BYTES_PER_BURST_WIDTH so far. But
that generates warnings from both Quartus and Vivado since the parameter is
not part of the parameter list.
Fix this by changing it to a localparam and computing the log2() manually.
The upper limit for the burst length is known to be 4k, so values larger
than that don't have to be supported.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
A larger store-and-forward memory provides better protection against worst
case memory interface latencies by being able to store more data before
over-/underflowing.
Based on empirical testing it was found that using a size of 4 bursts can
still result in underflows/overflows under certain conditions. These do not
happen when using a size of 8 bursts.
This change does not significantly increase resource consumption. Both on
Intel and Xilinx the block RAM has a minimum depth of 512 entries. With a
default burst length of 16 beats that allows for up to 32 bursts without
requiring additional block RAM.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
When the source and destination bus widths don't match a resize block is
inserted on the side of the narrower bus. This resize block can contain
partial data.
To ensure that there is no residual partial data is left in the resize
block after a transfer shutdown the resize block is reset when the DMA is
disabled.
Currently this is implemented by tying the reset signal of the resize block
to the enable signal of the DMA. This enable signal is only a indicator
though that the DMA should shutdown. For a proper shutdown outstanding
transactions still need to be completed.
The data that is in the resize block might be required to complete those
transactions. So performing the reset when the enable signal goes low can
lead to a situation where the DMA tries to complete a transaction but can't
do it because the data required to do so has been erased by resetting the
resize block. This leads to a dead lock and the system has to be rebooted
to recover from it.
To solve this use the sync_id signal to reset the resize block. The sync_id
signal will only be asserted when both the destination and source side
module have indicated that they are ready to be reset and there are no more
pending transactions.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
The width of the AXI burst length field depends on the AXI standard
version. For AXI3 the width is 4 bits allowing a maximum burst length of 16
beats, for AXI4 it is 8 bits wide allowing a maximum burst length of 256
beats.
At the moment the width of the length signals are determined by type of the
source AXI interface, even if the source interface type is not AXI. This
means if the source interface is set to AXI3 and the destination interface
is set to AXI4 the internal width of the signal for all interfaces will be
4 bits. This leads to a truncation of the destination bus length field,
which is supposed to be 8 bits.
If burst are generated that are longer than 16 beats the upper bits of the
length signal will be truncated. The result of this will be that the
external AXI slave interface (e.g. the DDR memory) and the internal logic
in the DMA disagree about burst length. The DMA will eventually lock up
when its internal buffers are full.
To avoid this issue have different configuration parameters for the source
and destination interface that configure the AXI bus length field width.
This way one of the interfaces can be configured for AXI3 and the other for
AXI4 without interfering with each other.
Fixes: commit 495d2f3056 ("axi_dmac: Propagate awlen/arlen width through the core")
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Add some limit TLAST support for the streaming AXI source interface. An
asserted TLAST signal marks the end of a packet and the following data beat
is the first beat for the next packet.
Currently the DMAC does not support for completing a transfer before all
requested bytes have been transferred. So the way this limited TLAST
support is implemented is by filling the remainder of the buffer with 0x00.
While the DMAC is busy filling the buffer with zeros back-pressure is
asserted on the external streaming AXI interface by keeping TREADY
de-asserted.
The end of a buffer is marked by a transfer that has the last bit set in
the FLAGS control register.
In the future we might add support for transfer completion before all
requested bytes have been transferred.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
The DMAC currently doesn't support transfers where the length is not a
multiple of the bus width. When generating the wstrb signal we do pretend
though that we do and dynamically generate it based on the LSBs of the
transfer length.
Given that the other parts of the DMA don't support such transfers this is
unnecessary though. So remove it for now and replace it with a constant
expression where wstrb is always fully asserted.
The generated logic for the wstrb signal was quite terrible, so this
improves the timing of the core.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Currently the read side of the src_response interface is not used. This
leads to warnings about signals that have a value assigned but are never
read.
To avoid this just comment out all signals that are related to the
src_response interface for now.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Make sure the req_gen_valid and req_gen_ready signals are declared before
they are used. Strictly speaking the current code is correct and synthesis
correctly, but declaring the signals make the intentions of the code more
explicit.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
All the hdl (verilog and vhdl) source files were updated. If a file did not
have any license, it was added into it. Files, which were generated by
a tool (like Matlab) or were took over from other source (like opencores.org),
were unchanged.
New license looks as follows:
Copyright 2014 - 2017 (c) Analog Devices, Inc. All rights reserved.
Each core or library found in this collection may have its own licensing terms.
The user should keep this in in mind while exploring these cores.
Redistribution and use in source and binary forms,
with or without modification of this file, are permitted under the terms of either
(at the option of the user):
1. The GNU General Public License version 2 as published by the
Free Software Foundation, which can be found in the top level directory, or at:
https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
OR
2. An ADI specific BSD license as noted in the top level directory, or on-line at:
https://github.com/analogdevicesinc/hdl/blob/dev/LICENSE
Currently the AXI address width of the DMA is always 32-bit. But not all
address spaces are so large that they require 32-bit to address all memory.
Extract the size of the address space that the DMA is connected too and
configure reduce the address size to the minimum required to address the
full address space.
This slightly reduces utilization.
If no mapped address space can be found the default of 32 bits is used for
the address.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Depending on whether the core is configured for AXI4 or AXI3 mode the width
of the awlen/arlen signal is either 8 or 4 bit. At the moment this is only
considered in top-level module and all other modules use 8 bit internally.
This causes warnings about truncated signals in AXI3 mode, to resolve this
forward the width of the signal through the core.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>