/*******************************************************************\
|                        USAGE INSTRUCTIONS                         |
+-------------------------------------------------------------------+


+-------------------------------------------------------------------+
| INTRODUCTION                                                      |
+-------------------------------------------------------------------+

The tool generates random applications from components.
More precisely, these are the inputs and the outputs:

 o Inputs

   - Components to compose the application from.
   - Probability of using each component.
   - Parameters of components to use.

 o Outputs

   - Executable application.
   - Benchmark harness for the application.
      - Measurement of compound performance.
      - Measurements of isolated component operations.
      - Measurements of component operations in compound workload.
   - Architectural description of the application.

The purpose of the tool is to provide testing applications for
performance modeling. The idea is to generate many random
applications and both measure and model them, comparing
the results.


+-------------------------------------------------------------------+
| PREREQUISITIES                                                    |
+-------------------------------------------------------------------+

The tool depends on the R project to be installed. Additionally, two
R packages need to be available and can be installed by invoking R
and the following commands:
  
  packages.install("zoo")
  packages.install("gplots")


+-------------------------------------------------------------------+
| CONFIGURATION                                                     |
+-------------------------------------------------------------------+

The tool provides two different sets of components, one for the C++
and the other for the java language. User can choose components only
from one of the sets, not both at the same time.

The configuration files are placed in bin/conf/cpp or /bin/conf/java
directories. Check out the provided configuration files for
descriptions of configuration options if overriding the defaults is
desired. The configuration is divided into shared part (shared.conf)
defining available modules, global and default options, and specific
configuration which include and override the shared defaults. Two
example configuration files are provided (reliable-single-threaded
and reliable-multi-threaded).

Since the default architecture generator ("bottom-up") needs to know
at least roughly estimated durations of individual modules in order
to restrict the generated architectures to those meeting given upper
bound on expected service time, these estimates have to be provided
for all available modules before generating random architectures.

To obtain the estimates, an extra configuration file (measure-expected-
durations.conf) is provided to generate an architecture that contains
all modules from the shared.conf file once. The measure-expected-
durations-cpp.sh script automates generation and execution of this
architecture, and converting the results to the expected-durations.conf
file, which should be then included in configuration files that use the
bottom-up generator.

The expected durations are thus obtained by executing the measure-
expected-durations-cpp.sh script from the bin/ directory. This step
has to be repeated only when the configuration of available modules
is changed, or when changing the hardware platform.

To generate the application, use the generate-architecture_cpp.sh or
generate-architecture_java.sh script to generate an application.
Which one depends on what module language you have chosen.


+-------------------------------------------------------------------+
| C++ MODULES                                                       |
+-------------------------------------------------------------------+

 * Modules of the same class can appear several times in a generated
   architecture, and are thus instantiated multiple times. Also, each
   worker thread has own set of module instances. No synchronization
   is thus needed, unless module uses global data (see below).  

 * Parameter notes:
   Most modules have parameters, which can be declared as constants,
   random selection from a given list of alternatives, or a range
   etc. The random selection can be done at generation time (runtime
   parameters become constant), or at runtime. Some parameters values
   are obtained on module initialization, some at each module invocation.
   Since initialization of a module instance in architecture is done
   only once per worker thread (see above), it is not recommended to
   declare such parameters to be randomized at runtime, as it would
   introduce fixed bias between the threads and the number of threads
   typically is not so high to allow sufficient randomization.
   Such parameters are marked as INIT below, and a runtime warning is
   issued if they are not constant at runtime. 

 o SPEC CPU modules

   The following modules were ported. Note that their execution is
   serialized among multiple threads via a mutex per each module instance
   in the architecture. Isolation of global data between module instances
   is achieved at a linker level by symbol duplication and renaming.

   - 401.bzip2
   
     This module was modified to perform initialization and the actual
     work separately. The data is prebuffered in memory during initialization,
     and file operation calls are satisfied from there - this was already
     implemented there by SPEC.
     
     The work performed during invocation is compression, decompression and
     validation - this could be possibly reduced.
     
     Currently the data is global and a mutex is used, but the amount of global
     data seems low, so this could be improved.
     
     Parameters:
     - filename (INIT) - input file for compression
     - inputsize (INIT) - size of the input data for compression, in 128KB
       granularity. The input file is replicated as necessary if smaller.   
     
   - 429.mcf
   
     Separating initialization and work is not trivial in this module. However
     the kernel time overhead is quite small (2%).
     Since all the input files from SPEC suite were too big, they were trimmed
     down.  
   
   - 444.namd
   - 458.sjeng
   
     This module was modified to perform initial allocation and the actual
     work separately. No large changes were needed as the source was
     already able to process several input configurations in sequence,
     and the cleanup between configurations is quite lightweight.
     By separating allocation and deallocation, kernel overhead during
     the work invocations was eliminated.
     The module uses global data heavily, work invocation is thus protected
     with a mutex.     
     
     Parameters:
     - filename (INIT) - file with input containing even number of lines,
       where the first line in pair defines chess board configuration and
       the second line defines maximum depth of thinking 
   
   - 462.libquantum
   
   - 470.lbm
   
     This module was modified to perform initialization and the actual
     work separately.
     
     Originally, the module on each invocation allocated a pair of grids,
     initialized them from the input file, performed the calculation,
     and deallocated the grids. This yielded significant time spent
     in the kernel (it is not that significant in the SPEC benchmark
     itself where many steps are performed during the calculation, the
     number of steps per invocation is much smaller in RPG). 
     
     Now, the grids are allocated and initialized in init phase. In the
     work phase invocations, a number of steps of the calculation is performed.
     This eliminates the kernel time overhead. Note that the grid is modified
     between invocations, as it seems that the code path does not depend on the
     content of the grid, and the invocations should thus have the same
     performance, which seems to be confirmed in isolated measurements.
     The deallocation is done in the module destructor.
     
     Since there appears to be no global data used by the module, the
     mutex could be removed, but the problem is that each grid instance
     occupies 200 MB memory, thus 400 MB in total. Having separate instance
     per thread is thus quite infeasible, and the module thus uses class-static
     data and mutex.
     
     Parameters:
     - filename (INIT) - input file (defines locations of obstacles)
     - steps - how many steps of simulation to perform in each invocation
     
   - 473.astar

   The following modules were deemed too large for porting.

   - 400.perlbench
   - 403.gcc
   - 433.milc
   - 435.gromacs
   - 436.cactusADM
   - 445.gobmk
   - 447.dealII
   - 450.soplex
   - 453.povray
   - 456.hmmer
   - 464.h264ref
   - 471.omnetpp
   - 482.sphinx3
   - 483.xalancbmk

   The following modules were in different languages and therefore not ported.

   - 410.bwaves
   - 416.gamess
   - 434.zeusmp
   - 437.leslie3d
   - 454.calculix
   - 459.GemsFDTD
   - 465.tonto
   - 481.wrf

o FFTW module

  Each instance in each thread uses a static buffer (or buffers) for input and output.
  The buffer is allocated and initialized and fftw planning performed during module
  initialization. This means that where multiple parameter variants are declared,
  the random selection occurs only once per thread during initialization. It is therefore
  discouraged.
  
  - method (INIT) - "in-place" for one buffer, "separate" for two buffers
  - bufsize (INIT) - size of one buffer (bytes)

o LZW compression module (based on ncompress)

  Execution is serialized between threads as in the SPEC CPU modules.
  
o Calculation module

  RIP module interleaving randomized memory traversal with simple calculations.
  The randomized pointer chain is created at module initialization time.  
  
  - blocksize (INIT) - memory size in bytes to be used
  - stepsize (INIT) - granularity of addresses that are accessed in the block (bytes)
  - mem-cycles - number of memory accesses to perform per invocation 
  - comp-cycles - number of calculations to perform per memory access

o Memspeedmp module

  Multipointer memory walk from RIP. In init(), memory areas for each pointer are allocated,
  and total memory occupied by all areas is a parameter. 
  
  - access (INIT) - "linear" or "random"
  - memsize (INIT) - total memory size to traverse (bytes)
  - stepsize (INIT) - granularity of distances between accesses (bytes)
  - pointers (INIT) - how many pointers to use
  - write - 1 = perform memory writes along with reads, 0 = no writes
  
o Thrasher module  

  Thrasher from RIP, similar to memspeedmp, but adds nops between defererences.

  - access (INIT) - "linear" or "random"
  - memsize (INIT) - total memory size to traverse (bytes)
  - stepsize (INIT) - granularity of distances between accesses (bytes)
  - pointers (INIT) - how many pointers to use
  - nopcount (INIT) - number of nops to perform between advances of all pointers
  - maxnopsinfunc (INIT) - maximum number of nops in one function (if more needed, function called multiple times) 
  - write - 1 = perform memory writes along with reads, 0 = no writes
   
o Interleave module

  Interleave from RIP, accesses memory area by multiple pointers, where each pointer is advanced
  hops times before switching to another pointer.
  
  - blocksize (INIT) - memory size in bytes to be used
  - stepsize (INIT) - granularity of addresses that are accessed in the block (bytes)
  - pointers (INIT) - how many pointers to use
  - hops            - how many times pointer is advanced before switching to another one

   
+-------------------------------------------------------------------+
| JAVA MODULES                                                      |
+-------------------------------------------------------------------+

 o SPECjvm2008 modules

   The following modules were ported. Note that their execution is
   serialized by mutex per each module (shared by all instances of
   the particular module).

   - aes
   - compress
   - mpegaudio
   - rsa
   - signverify
   - xml-transform
   - xml-validation

   The following modules were ported without mutex serialization.

   - fft
   - lu
   - monte-carlo
   - serial
   - sor
   - sparse

    The following modules were not ported.

   - compiler - introduces limits on java compiler versions
   - derby - requires whole database server
   - sunflow - multithreaded benchmark

+-------------------------------------------------------------------+
|     Copyright (C) Petr Tuma <petr.tuma@d3s.mff.cuni.cz>           |
|                   and other contributing authors of               |
|                   Department of Distributed and Dependable Systems|
|                   Faculty of Mathematics and Physics              |
|                   Charles University, Czech Republic              |
\*******************************************************************/
