Reply to Reviewer 1

is detailed and includes the main results. The absence of information about salinity is deplored. will include a short statement regarding the main salinity results.

Spin-up : There are many references to model stability in the article, however, in the supplementary material there is no figure showing the stability of each of the models. Although the recommendations are clear, they are not explained. Why is it recommended to run the simulations again in July 2004 and not another month? What initialization was used to start the spin up runs?
The BMIP protocoll provides no initial data for the start of the spinup. As the Baltic Sea has an overturning time of about 30 years, BMIP gives the conservative recommendation for a 1961-2004 (44 years, thus > than the Baltic Sea overturning time) spinup to reach an equlibrium where potential drifts can be minimized. BMIP recommends to start the production runs in mid-summer as in this season Major Baltic Sea salt inflows (MBI) from the North Sea are extremely unlikely. We will make this more clear in the revised version.
Also implied by these comments is the issue of applying the same spin-up for all models despite differences in grid resolution and turbulence schemes. This is correct. We will include a comment that model internal turbulence schemes and resolution may influence the time the model reaches equilibrium. That's why we recommend an at least 44 year spinup duration (>overturning time) which is a compromise between costs to drive the model and the minimization of potential drifts.
We will make this more clear in a revised version.
Analysis methods : It may be of interest to indicate the error associated with the postprocessing of AVHRR data.
Thank you for your comment. To address it, we downloaded the raw AVHRR dataset and compared it with the postprocessed dataset (Fig R1). This figure shows that the raw AHRR dataset underestimates the upwelling frequency by 0.9% but overestimates the spatial variability because it overestimates the frequency in Bothnian Bay. This result is consistent with the principle of post-processing as it unmasks regions misidentified by the cloud detection algorithm. We added this paragraph in the ms to discuss this point: "A comparison between raw AVHRR dataset and the post-processed dataset reveals an underestimation of annual upwelling frequency of ~1% (not shown) which is of the same magnitude order as the models error. Therefore it is important to note that in order to assess the ability of the regional model to simulate coastal upwelling, the choice of the satellite data set is crucial." The method we chosed is easy to implement and has been tested and applied many times in the Baltic Sea (e.g. Lehmann et al., 2012;Gurova et al., 2013Dutheil et al., 2021 contrary to the suggested methods. Nevertheless we acknowledge that the suggested method can indeed used to avoid the bias related to the orientation of the coast. However, in the original study of Abrahams A, Schlegel RW, Smit AJ (2021) A novel approach to quantify metrics of upwelling intensity, frequency, and duration. PLoS ONE 16 (7): e0254026. https://doi.org/10.1371/journal.pone.0254026 the suggested method was adapted to the coast of South Africa and requires the choice of certain thresholds and includes also the wind field evaluation. Hence, more investigation and intense analysis will be necessary to adapt this method to the Baltic Sea which is beyond the scope of this study. However, we are encouraged to to the work and adapt the method for the Baltic Sea in a followup study with specific focus on upwelling.

Results:
In the introduction of the results, it is stated that different runoffs were for the HBM model. This part should be in the material and method explaining the reason for this choice and specifying which runoff were used.
As HBM is an operational setup, it is straight forward for its implementation to utilize the respective runoff data set for this purpose. Nonetheless, the hydrological dataset is derived from the same source as for other models, i.e. E-HYPE forecasts We will include this note in the methods section of the revised version.
In the first part of the results the role of thermoclyne formation in the sensitivity of the SST to variations in meteorological forcing is stated but sparsely discussed. This lacks discussion and bibliographic references.
We agree. We will add a short paragraph on the role of thermocline formation and add bibliographic references.
The section dealing with seasonality needs to be restructured. Suggestion: Discuss the divergences of the models, station by station, with respect to temperature and then do the same for salinity, in the same way as the introduction to Figure 5. Indeed, the paragraphs introducing the stations describe sometimes the variability of temperature, sometimes that of salinity.

The discussion of temperature variability for the Nemo model is missing.
Thank you for the suggestion. We will think about the structure and revise it accordingly. We will also include NEMO temperature variability in the discussion.
Long term variability: In this part we still refer to the stability of the models. It is therefore necessary to put the figures that illustrate these remarks in the publication.
Yes, the section 3.4 "Long term variability of temperature and salinity" shows deep water time series which are related also to stability. The long-term development of salinity is a good indicator for this. The salinity at the deep stations BY15 and F9 show that for all models but HBM there are no significant drifts. We will include a remark about this in the revised version.
However, we want to avoid any further analysis and production of new figures on this issue as this is beyond the scope of the manuscript.

Also, in this and several times, it is referred to divergences of models because of their different management of ice modules, what about turbidity that can limit the heat flux?
It is true water turbidity and of course the individual models' light penetration scheme will also influence the heat fluxes in addition to sea ice. We will make a remark about this in the revised manuscript.
Marine heat waves: Figure 8 with Table 1 Table 1 is an overview showing model setup characteristics. Table 2 which lists yearly mean and maximum surface and bottom temperature trends in the spatial averages over the Baltic Sea? We agree a comparison with observed extreme values would be very interesting but to our knowledge no observational data sets exist that would allow the calculation of such long term trends in spatial averages over the entire Baltic Sea and over such a long time. Thus, this would require additional intense processing of observational data to allow a reasonable comparison with the models. This work is however, beyond the scope of our study which aims to highlight model differences (and thus uncertainty) despite one and the same forcing. We agree. Due to the delays in the production of the simulations there was an offset between analysis and data availability from the respective models. However, meanwhile all analysis is complete and we will include MOM_1nm in the upwelling analysis. Thank you for the comment. We will include a short note on what could be investigated in further studies to elaborate on the timing of thermocline formation, such as vertical turbulence schemes, the momentum transfer from wind into the sea, or different schemes for the light penetration into the water column. We will also include references for this. Finally, salinity has once again been little discussed even though it is strongly impacted by

runoffs, MBI…
That's true. This reflects also that salinity dynamics is very complex in the Baltic Sea. In this first BMIP introduction paper, however, we can not go to deep into the details. Definitely this interesting topic will be taken up in follow-up studies.

Technical corrections
We thank the reviewer for the technical suggestions given below to improve the figures. We will revise the figures accordingly to facilitate the interpretation for the readers. We also thank for the correction of the reference list. Fig.3 Use a different color palette for absolute values and differences for better readability.