The Sociolinguistic Sampling. Does it Need to be Redefined?

The present article analyses a classic in the methodology on the analysis of the social variation of languages: the application of the ratio of 0'0025 % to obtain a representative sample of the population of a speaking community. This ratio, established empirically by Labov in 1966 for New York City, nevertheless presents important limitations when moving to communities with smaller populations. Replicating the empirical experimentation in four Spanish populations of different demographic size, it is shown that the empirically representative samples correspond to the confidence intervals already provided by the general statistics. Likewise, it is shown that these were the parameters between which 0,0025 % in the city of New York was developed. Consequently, the problem was not in the formulation of the ratio by Labov (1966), but in the subsequent indiscriminate application that has been made of it.


Introduction
The present research critically reviews the sampling criteria employed by variationist sociolinguistics, practically during the last fifty years. These criteria, without a doubt, have sustained a great deal of empirical work. Multiple speech communities around the world have been studied, in principle, with a solid empirical foundation. In comparison with the synchronous linguistics immediately preceding it, the variationist sociolinguistics came much closer to the demographic profiles of the areas it analysed.
For this purpose, I use a common and universal procedure, applied in a uniform way in any kind of community. This way of proceeding implied an evident contradiction. In general statistics, representativeness is directly linked to the size of the reference sample. This means that, by definition, communities with different demographic sizes must follow different representativeness ratios. Variationist sociolinguistics, therefore, introduced a counterexample, at the very least, to general statistics. In view of this, two options should be considered: A) The sociolinguistic reality is exceptional and, therefore, requires statistical procedures different from the usual ones. B) Variant sociolinguistics has excessively extended a uniform criterion that, in the end, was not adapted to the reality of the speech communities it studied.
This research tries to solve this dilemma, replicating the methodology used by Labov in New York, to evaluate the results obtained with general statistical procedures.

The Ratio of Representativeness in Variationist Sociolinguistics. Relevant Scholarship
In 1966 Labov established one of the great landmarks of 20th century linguistics. In the immediate field, it was logically for sociolinguistics, and especially for the analysis of linguistic variation. His study of the social stratification of New York English has been the benchmark on which such research has been based in virtually all parts of the world. But it was also for linguistics as a whole, insofar as it meant a radical -and transcendentalchange both in the theoretical and methodological order. In theory, it introduced a model that gave priority to the interrelationship between language and society as the epistemological core of linguistics. This implied, among other things, overcoming the classificatory vision of structuralism and the transcendentalism of generativism, almost newly introduced at that time. From the beginning, moreover, Labov's option proved to be a valid solution to the great issues that had preoccupied linguistics since time immemorial. Socio-linguistic variationism, the model directly inspired by Labov, immediately tackled the diversification of languages, change, their historical evolution or the skills needed to use human language. Later, this version of sociolinguistics became increasingly active in applied linguistics, starting with the teaching of foreign languages (Taronne, 1979(Taronne, , 1982(Taronne, , 1988Ellis, 1982Ellis, , 1984Ellis, , 1985 and ending with translation (Caprara, Ortega Arjonilla & Villena, 2016). Much of the immediate success of variationist sociolinguistics was due to the fact that methodologically it introduced a substantial transformation in empirical research. Dialectology, its immediate reference within linguistics, had operated with reduced sets of speakers that it considered idiosyncratic. He applied mainly geographical criteria, to which he sometimes added other observations, never systematic and stable. In opposition to this, sociolinguists operated on the basis of representative samples of social reality, filtered through three axes of conditioning on linguistic behaviour (linguistic, social and stylistic factors). This allowed them to access a deeper and more complex dimension of linguistic reality, about which a much greater and more precise knowledge could be obtained.
From the beginning, the variationist sociolinguistics made a special effort to refine a very precise and strict methodology that would allow it to fulfil this great objective. That is why it established equivalence sets to analyse its variables, discriminated the levels of influence of social factors and operated with representative samples of the communities it studied.
On this last aspect it was decisive to specify the minimum number of speakers from which a sample was considered reliable. This was another of Labov's great contributions, establishing a ratio, 0,0025 % to the reference population, which fully guaranteed representativeness. López Morales (1994) emphasized the extraordinary empirical solvency of Labov's proposal, based on the examination of the data (Labov 1966), which were stable up to that point, from which they began to deviate and, therefore, to be unreliable.
The ratio of 0,0025% became a true methodological postulate, completely undisputed and consistently applied in all research that has followed the Labovian perspective. It showed its solvency by obtaining sociolinguistic samples of enormous relevance, thus making an extraordinary contribution to the development of a much more rigorous empirical linguistics. Undoubtedly, it has been a more than decisive reference, thanks to which substantial progress has been made in linguistic research, especially at the empirical level.
This does not mean that you have used the most appropriate criteria to collect your demographic samples. In theory, 0,0025% guaranteed efficient sampling, regardless of the demographic volume to which it was applied. But, in practice, this produced very asymmetrical results. Despite this, the sociolinguistic literature has not discussed the universal validity of the ratio established by Labov.

Methodology
To replicate the methodology used in Labov (1966), four Spanish-speaking communities with very different demographic dimensions have been selected: Pedro Martínez (Granada, 1134 inhabitants), Motril (Granada, 58020), Almería (189680) and Barcelona (5.5 M), according to the last statistical census in force in Spain (2018). In each of them, a social factor has been selected through which to measure the stratification of the linguistic variables studied. The aim is to check from what level of representativeness the results deviate. Labov (1966) showed in New York that these did not vary until reaching 0,0025%. Before (with 0,5%, 1%, etc.) the sociolinguistic distribution was stable and offered the same results. Therefore, here we intend to check whether the same empirical methodology reaches the same conclusions.
Specifically, the linguistic variables and social factors listed in Table 1 have been examined.  Table 2 summarizes the results of a possible application of the 0,0025% ratio among a very diversified set of populations. As it can be easily observed, some hypothetical samples would be impossible (0,01 persons in Latour de Carol), others would be irrelevant (2.14 in Flensburg) or not acceptable (14 in Málaga) and, finally, one could work with reasonable samples above one million inhabitants (Paris, Moscow…); that is, in contexts similar to New York. As noted, table 2 shows the serious difficulties involved in the indiscriminate application of the 0,0025% ratio. In effect, up to group F (more than 1M inhabitants) the samples that would be obtained would not be qualitatively acceptable. Of course, in cities with less than 10,000 inhabitants it would be impossible to have even one speaker. In practice, these limitations have been resolved by managing much higher percentages of representativeness (García Marcos, 1989, 2020 for obvious reasons. Thus, at least in a specific type of speaker communities, this proportion has not been maintained in studies of sociolinguistic variation.
In the specific case analysed here, three of the four speaking communities examined would have serious difficulties if the 0,0025% ratio were applied. Except for Barcelona, all the others would not be able to establish minimum samples with which to work. Precisely, the procedure to be used tries to establish empirically what the relevant ratio is in each case.

The Case of / [t ͡ ʃ/ in Pedro Martínez (Granada)
First, we examine a small rural community, Pedro Martinez, located in the province of Granada. Here, the phoneme is analysed the phoneme /[t ͡ ʃ/ ("CH", in orthographic notation). This phoneme has a very low frequency of appearance in Spanish. It is the penultimate in all counts from Zipf and Rogers (1939) to Perez (2003), with an average appearance of no more than 0,34 %. Despite this, it has shown significant variability in the Spanish language, both in Europe and in America.
In Andalusia, the socio-cultural environment, which includes Pedro Martínez and the province of Granada, has also behaved in this way. The Linguistic and Ethnographic Atlas of Andalusia (ALEA from 1959), a reference work on the Spanish dialectology of the time, documented a considerable diversification in its domain. The appearance of the fricative allophones [ʃ], present in practically all of Andalusia, although in a somewhat irregular form, is noteworthy. The province of Granada was no exception. Fricative achievements were documented in the Costa de Granada region, in the capital and in its metropolitan area. In the interior of the province, the ALEA detected another variant of /[t ͡ ʃ/, a cacuminal realization that the surveyor of the atlas (G. Salvador) noted in Iznalloz, the main town of the Montes Orientales region, to which the town of Pedro Martínez belongs.
This description of the dialectal situation in the area has remained valid until practically the present day. However, a mere impressionist approach to the linguistic reality of the area creates certain doubts, since other variants could be found that were not detected by the dialectologists. The suspicion was confirmed in García Marcos (2019a), when the spectrographic results attested the presence of an unexpected variant, a clear [tj], closer to the adherent realizations of this phoneme that have appeared in the Canary Islands.
The distribution in apparent time of the variants of /[t ͡ ʃ/ in Pedro Martínez indicated that at that time it was in a situation of notorious regression, although it must have had an expansion, if not complete, at least considerable through the whole sociolinguistic spectrum of the locality. At least this seemed to be deduced when examining the generational spectrum, which included the non-normative variants in older speakers. Thus, in 2019 the sociolinguistic distribution of [tj] in Pedro Martínez was mainly conditioned by three social factors: 1. Age, which described the process of generational regression just discussed.

2.
Academic background, which distributed the adherent variant following a descending curve from speakers with a lower level of education (higher rates of appearance of the variant) to university speakers (testimonial appearance of [tj]).

3.
The social network among which the speakers were located. The endogamic type (the dense ones in the sociolinguistic bibliography) favoured the appearance of [tj]. On the contrary, the exogamous type (the diffuse ones) provoked a remarkable increase of the [t ͡ ʃ] normative.
Thus, for the study that is being considered here, a considerably precise characterization of the sociolinguistic dynamics between which the alternation of [tj] and [t ͡ ʃ] in Pedro Martínez's Spanish was developing was available. Therefore, the most reasonable thing to do was to maintain the same set of equivalence that had been used in García Marcos (2019a), in which basically the two variants that had been documented would be considered. Also, to adjust the precision of the experiment, one of the three factors mentioned above has been selected. Specifically, the social network has been included, as it contained the most internal variability. Age and school training were very much in the foreground in its internal stratification, so it seemed preferable to opt for the third possibility in terms of social determinants. Thus, the distribution of these two variants was examined, both in the general distribution of their results and through their sociolinguistic paths among speakers with dense or diffuse social networks.
Following these criteria and conditions, the tabulation of the corpus of García Marcos (2019a) has offered the results shown in table 4. As can be easily seen, the point from which the data deviation starts is not at 0,0025 %, but at 1.41 %, which would correspond to a minimum sample of 16 speakers. Below that figure the data are completely distorted. Moreover, the sample behaves homogeneously in all its components. The figures have remained constant up to 1.41% in terms of the general distribution of variants in the speaking community, but also when observing the intervention of its social factors. In fact, the distance between inbred and exogamous speakers has been stable just up to that point, so it seems to be a common feature of the whole sample, regardless of the magnitude being measured.

[s-Ɵ] in Motril (Granada)
For the second point of analysis, the Granada town of Motril, one of the most constant processes of variation and change in the history of Spanish has been examined, with a case study that is especially unique in Andalusia. The transition from medieval Spanish to the Golden Age meant, among other things, a complete readjustment of the sibilant order. As a result, three variants concurred to occupy the same position in the new consonantal system: the distinction of [s-Ɵ] and two neutralisations, one seseant [s] and another lisp [Ɵ]. The normative Spanish opted for the first option, although the seseant was imposed in America and in part of the peninsular South and the Canary Islands. To these three solutions was added another variant, [h], which was much more sporadic and dialectally localised, although it was present in some parts of Andalusia (Alonso, 1951;Lapesa, 1962;Llorente, 1962).
The Costa Granadina, the region of which Motril is part, documents the four variants (García Marcos, 1989, 2020 that are included here. Sociolinguistic research (García Marcos, 1989, 2020  The city of Motril brings together both scenarios. The nucleus of the population is located in the interior. It has an important seafaring neighbourhood two kilometres away, with very idiosyncratic sociological characteristics. In this way, it has been possible to incorporate the urban/seafood social factor into the analysis, in order to obtain the following table of results.  Therefore, what had started to be described down in Pedro Martinez is confirmed in Motril. To meet the requirements of sociolinguistic representativeness, 50 speakers would be needed, that is, a sample of 0,086% of its population, 34.4 times more than 0,0025 %.

/tr/ in the City of Almería
The third observation point was located in the city of Almería, where two linguistic variables have been analysed, /tr/ and, again, [t ͡ ʃ/.
The articulation of the first of these variables, the consonantal group /tr/, in the city of Almeria presented another variation not detected by dialectology. Although the ALEA paid attention to the city, its point Al-508, did not perceive any variation in that phonic sequence. However, the exploratory sociolinguistic research of García Marcos (1999) did find a tendency to affricate the group. In the Spanish of America, it had been a perfectly delimited phenomenon, especially in Chile. From Lenz's initial research to the present day there has been an evident continuity in these studies (Figueroa, 2008). In the Peninsula it was not a completely unknown phenomenon either. In fact, when examining the Chilean situation, A. Alonso (1925Alonso ( , 1933 insisted that there were multiple solutions for /tr/ within the Hispanic domain, also in the Peninsula. Only that its radius of peninsular extension was located in the north of Spain, more specifically, in the Basque Country, Navarra and La Rioja. The possibility that it existed beyond that environment had not been openly and systematically considered either. The southern solutions of Almeria, in principle, were not within the foreseeable. As it happened in Pedro Martínez, the spectrographic analysis confirmed the articulation of a variant close to the Chilean affricate solutions (García Marcos, 2019a). Once again, the set of equivalence made up of two variants was maintained, as it had been used at the time. On this occasion the social factors that conditioned the variation were, firstly, sex, and more distantly, cultural level and age. In general terms, there was considerable variation and presence in practically all social groups. But, in any case, affrication predominates among women, especially in those with average or scarce cultural training. Furthermore, it is predominant among speakers over 50 years old. So, finally, here the sex factor was incorporated into the analysis, the social conditioning with the greatest stability of all, to complete the general information on the variation of /tr/. ilr.ideasspread.org International Linguistics Research Vol. 3, No. 4;2020  As in the two previous cases, a distortion of the results from a sample point other than 0,0025 % is empirically corroborated. On this occasion, it is situated at 83 speakers, which is equivalent to a ratio of 0,043 % of the 189680 inhabitants of the city. This figure also remains unchanged across the whole sociolinguistic spectrum.

/t ͡ ʃ/ in Almeria
As indicated, a second variable was included for Almeria. Garcia Marcos' exploratory research (1999) uncovered another variable that had not been found in the dialectology literature on Almeria either. The enormous variation of the phoneme / [t ͡ ʃ/ that had been recorded by ALEA (1959) in Andalusia, which has been referred to when analysing Motril (Granada), in Almeria also located a fricative variant [/ʃ/], on this occasion. In García Marcos (1999) its presence was confirmed in the sociolinguistic spectrum of the city, fundamentally conditioned by the factors of sex and age. Fricative variants were predominantly found among male speakers. At a smaller distance, it was a phenomenon with an increased presence among older speakers, although it was also located among the rest of the age groups. Therefore, the main social conditioning came from the sex factor, which also allowed maintaining the same contrasting criterion in the previous variable. In this way, a second approach to the same sociolinguistic spectrum, Almería, was produced, although controlling a new linguistic variable. The results of this new exploration are the following: As can be seen, the ratio that keeps the results stable is around 0,043 %, exactly the same as in the analysis of the previous Almeria variant. This complete coincidence confirms that it is a value that affects the whole of sociolinguistic variation within the speech community and, therefore, reinforces the hypothesis among which this work is being developed.

Sociolinguistic Attitudes in the City of Barcelona
The last point of analysis has been located in Barcelona, where two other variables will be examined, although they now refer to sociolinguistic attitudes. Catalonia in general, and Barcelona in particular, have been the object of permanent sociolinguistic attention. Since Badia's pioneering and almost foundational study (1969) of the language of the people of Barcelona, sociolinguistics has not neglected a community with an obvious attraction for analysing the relationships that link societies and languages. Barcelona is a macrocosm of 5.5 million inhabitants, 1.5 times smaller than New York, which has been the recipient of migratory contingents practically since the beginning of the 20th century. In Catalonia, moreover, there has been a secular linguistic contact with the coexistence of Spanish and Catalan. There have been different stages in relation to this linguistic contact, among which two recent ones stand out, which are very marked from the sociolinguistic point of view. The dictatorship of General Franco ) meant a strong repression for Catalan, reduced to the most informal levels of the socio-functional spectrum. After the reinstatement of Spanish democracy, from the achievement of political autonomy in 1979 Catalonia began a progressive linguistic planning aimed at the normalisation of its vernacular.
This means that, at least in theory, the aim was to modify the diglossia inherited from the previous stage, in which Catalan had been reduced to the status of a B-language. Naturally, the direction and intensity of this transformation of the diglossia that was in force until 1979 has involved different and even opposing positions, so that a gradation of options has been established, sometimes at great social risk: 1.
The complete suppression of the Spanish language from the functional repertoire of Catalonia.

2.
The inversion of the diglossia, placing Catalan as an A-language and Spanish as a B-language.

3.
The formal and socio-functional equalization of both languages, which would mean the establishment of the co-officiality they enjoy.

4.
The maintenance of the previous diglossia.
This has meant an obvious linguistic debate, very present in social life, within which reality has not always been in step with political provisions. In García Marcos (2019b) it was found that the official regulations on compulsory signage in Catalan were not applied regularly, especially in the urban centres of significant populations in the capital's metropolitan area. This was a significant indication of the sociolinguistic conflict referred to above. 4.6.1 Throughout this process it has been essential to know the attitudes of citizens towards the two main languages of the long and dense multilingualism that is registered in Catalonia, and in Barcelona in particular. For the experiment to be developed here, the studies by Huguet (2007) and Gracia (2012) on this issue have been taken as a reference. After collecting materials in September 2019, we first examined attitudes towards the languages involved in this contact and the hypothetical predominance of one of them. Empirical research showed that positions in real society are less polarized than reflected in the political debate. Even so, the differences exist and are therefore relevant for a sociolinguistic analysis.
The first group took over the assessments of the Barcelona people towards the Catalan and Spanish. To do this, three possible attitudes were considered: A. in favour of each of the languages (+) B. refusals (-) or C. neutral (=). ilr.ideasspread.org International Linguistics Research Vol. 3, No. 4;2020 Published by IDEAS SPREAD The latter possibility occurs when respondents have either explicitly stated that they are or have refused to make a statement in any direction. It should be made clear that these solutions have not been offered as mutually exclusive possibilities. Respondents could assign two attitudes: negative and positive or give up in both cases. Thus, after tabulating the data, the following table of results was obtained. This time, the results remain stable at 0,00498 %, which translates into 274 speakers. With only one speaker less, at 0,00496 %, the first significant deviations are already beginning to appear. It could be discussed whether they are sufficient to not accept the relevance of that sampling point. But, in any case, from that point on, the deviations are too ostensible.
Therefore, once again, a larger sample is required than that theoretically contemplated, although only 1.9 above. It is therefore confirmed that the ratio is more effective in large populations, demographically close to the empirical reference used by Labov, New York City.
4.6.2 The second group of attitudes focused on a much-debated question, practically from the beginning of the new language policy developed from 1979, about which language should be the predominant one in communication. An analogous criterion of attitude assignment was maintained, so that speakers could choose one, or several, options from among the four proposed: The results of this second approach to the sociolinguistic attitudes of the Barcelona population also showed very heterogeneous profiles, without a clearly predominant trend, as shown in table 13. As happened in Almeria, the behaviour of this second sample has been completely equivalent to that of the first incursion into the attitudes of the people of Barcelona. Once again, the relevant ratio is 0,00498, which confirms it as the appropriate parameter for the preparation of a sociolinguistic sample on the city.

Conclusions. Towards the Revision of the Reliability Ratio of Sociolinguistic Sampling
The results of the analyses carried out in this work warn of the inadequacy of using the ratio of 0,0025 %, as a methodological universal for the measurement of sociolinguistic variation. In all the cases examined, the deviation of the results has been above this parameter. It has done so, moreover, in a quite reliable way: it has worked both in action data and in attitude data, it has remained stable in more than one variable of the same community (Almería, Barcelona) and, finally, it has demonstrated its applicability in the analysis of a variable as a whole, but also in its social subspecifications. Therefore, it is obvious that 0,0025% introduced a very important methodological novelty. But, in the same way, it is also evident that it is not a criterion that should be maintained with a universal character.
Naturally, more axes of observation could have been included, to observe also linguistic conditioning and stylistic variation. However, they would not have modified the final results. The aim here was not to measure the correlation of sociolinguistic variation, but a preliminary question, such as the minimum number of speakers to guarantee the reliability of the sample.
At this point, it is necessary to ask whether sociolinguistics should use special statistical methods or, on the contrary, may resort to standardized methods. Obtaining representative samples has a long tradition in statistics. Moreover, sociolinguistics is easily integrated into the field of stratification-type probability statistics. As a general criterion, these samples are configured on the basis of four main factors: the size of the reference universe (the inhabitants of a community of speakers), the level of confidence, the admitted margin of error, and the internal diversity of the reference universe. This is usually done by means of a classical calculation, summarized in the following formula (Cochran, 1982: 104;Vivanco, 2005: 53).
where N = size of the population; e = margin of error (percentage expressed with decimals); p = probability and z = z-score. This last parameter is in charge of measuring the amount of standard deviations that are proportionally far from the mean.
In principle, the ratios obtained empirically in the assumptions handled here correspond to wide confidence intervals and small margins of error, and therefore perform positively within the general statistics. Table 13 provides relevant information in that direction.  It has its limitations, however, because there is still a tendency for the sample volume to increase as the demographics of the reference communities decline. Even the empirical ratios have shown the lowest levels of confidence in the smallest populations: 75% (Pedro Martínez) and 84% (Motril) compared to 93% (Almería) and 90% (Barcelona). Obviously, the margin of error follows a downward trend in the opposite direction: 15% in Pedro Martínez, 10% in Motril and Almería, 5% in Barcelona.
It is certainly a paradoxical situation. In fact, Labov's ratio (1996) follows standardized statistical criteria, although only for New York. That year, 1996, the official US census registered 8,399,000 inhabitants in the city. The 0,0025% of that population resulted in a sample composed of 210 speakers. This figure coincided with the results that would have been obtained had standardized criteria been applied, with a 95% confidence level and a 7-8% margin of error. Labov's empirical method -probably without being aware of it -had brought it to the same place as the general statistics.