Dear Guest Editors and Dear Reviewers,
First of all, thank you for your last round of review. We have tried to improve our paper by answering to the questions raised by the reviewers. In this revision the modified parts of the manuscript are highlighted using a blue colored font (but the plot figures, that have all been changed but are in black).  Furthermore, you can find below our considerations and answers to each specific point.
We hope your favorable consideration for publication
Best regards,

The authors

Reviewer #1:
-How many such state access patterns do we need? Are the patterns here applicable in large applications?

>> We further specified the target of our work in the introduction and added other sentences in the paper, to clarify the point.

-Table 3 reveals that the options in the access dimension are indeed mutually exclusive. However, the partition dimension doesn't really seem to distinguish very much. Perhaps this because the access ordering issue is a semantic one, whereas the partitioning one seems to be "merely" performance oriented. You might like to comment on this in a final version.

>> We agree with the reviewer. We have added a sentence at page 8 (Sect. 3.7) explaining what suggested. It should be noted that the partitioning dimension is still useful. In fact, though it does not represent a necessary property for most of the proposed patterns, it is required by the “fully-partitioned” pattern (FPSAP) because it is based on the assumption that it is possible to determine for each input task the portion of the state that will be accessed. The other parallel pattern in case of the “ordered access” option (ANSAP) does not require a partitionable state but it is based on the assumption that not all the input tasks modify the state (which is another semantics property of the computation).

-Typos

>> All the suggested typos have been fixed in the revision.

Reviewer #2:

-The biggest issue is the limited extent of the evaluation. In particular, this is done only for smallish values for the parallelism degree (up to 64 cores), which makes it very difficult to extrapolate up to large HPC systems. Also, the I would expect to investigate the performance as a function of the duration of "tasks" to quantify the size of overheads and thus hint at a recommended task granularities.

>> We further specified we aim at targeting off-the-shelf multicore architectures. All the experiments have been re-run on an Intel Xeon PHI Knights Landing, representing state of the art in the hw segment. We have now new graphs up to 128 threads (roughly the number of the available FP vector units on the KNL). Experiments reporting the overheads introduced has also been included.

-Similarly, the authors do not convincingly show that the parallel patterns under consideration are relevant for large-scale applications as those encountered on HPC systems. It should be easy to extend the list of "motivating examples" in Fig 1 with use-cases from HPC workloads.

>> We extended the motivating examples figure with different use cases, more realistic for the the framework (hw & sw) considered in this work.

-Section 2: give names to T_s and T_c. What do the _s and _c stand for? Service and computation? Are these definitions ever used in the rest of the paper? In general, many symbols seem to be used without introducing them properly.

>> We have added those symbols in the sentences in which they are introduced for the first time (Sect. 2 at page 2).

-Section 3.2: the scalability estimate ignores the time to compute the predicate P(x_i). Is this justified?

>> We have added a sentence in Sect. 3.2 at page 4.

-Section 3.3: For clarity, introduce the term "hash function" already in the definition section. It is used in the implementation section.

>> Done

-Section 3.5: s_zero is not defined. Is the function g(x_i) in the "implmeentation" the same as \mathcal{G} in the "definition"

>> We have corrected the terminology.

-Section 4: "Actual computations are dummy computations only, spending time ...". Can you give the absolute values of these various times? What are the overheads of the patterns if t->0.

>> Added Figure with typical overheads (Fig. 5.b) and paragraph “Overheads” right before end of experiment section.

-Section 4;  Accumulator state access pattern: Can you comment on the case where t_f is similar to t_s?

>> We added sentences in the experiment section. 

-Figure 4: the use of "ideal A/B/C" on the one side and "ideal" on the other is confusing.

>> Fixed

-most figures: consider using log scale for the completion time. Is the completion time for one instance of the pattern or the whole set? The latter is useless unless you give the number of instances or the duration of a single. Consider normalizing times or putting units on fig 5.

>> Figures changed. Completion times are for the a single instance of the pattern.

-Typos

>> All the suggested typos have been fixed in the revision.
