Identifying the distribution difference between two populations of fuzzy data based on a...

IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERINGIEEJ Trans 2013; 8: 591–598Published online in Wiley Online Library (wileyonlinelibrary.com). DOI:10.1002/tee.21901

Paper

Identifying the Distribution Difference between Two Populations of FuzzyData Based on a Nonparametric Statistical Method

Pei-Chun Lin∗a, Non-member

Junzo Watada∗, Member

Berlin Wu∗∗, Non-member

Nonparametric statistical tests are a distribution-free method without any assumption that data are drawn from a particularprobability distribution. In this paper, to identify the distribution difference between two populations of fuzzy data, we derivea function that can describe continuous fuzzy data. In particular, the Kolmogorov–Smirnov two-sample test is used fordistinguishing two populations of fuzzy data. Empirical studies illustrate that the Kolmogorov–Smirnov two-sample test enablesus to judge whether two independent samples of continuous fuzzy data are derived from the same population. The resultsshow that the proposed function is successful in distinguishing two populations of continuous fuzzy data and useful in variousapplications. © 2013 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

Keywords: fuzzy numbers, membership functions, Kolmogorov–Smirnov two-sample test, goodness-of-fit test, empirical distribution function,fuzzy statistics and data analysis

Received 18 April 2012; Revised 19 November 2012

1. Introduction and Literature Review

The two-sample test is one of the most useful nonparametricmethods for comparing two samples because it is sensitive tothe difference of location or/and of shape between two empiricalcumulative distribution functions. Other nonparametric statisticaltests may also be useful [1], such as the median test, the Mann-Whitney test, and the parametric t test. The Kolmogorov–Smirnovtwo-sample test (hereafter, K–S two-sample test) is a goodness-of-fit test used to determine whether the two underlying distributionsof the samples differ. In this paper, we concentrate on the K–Stwo-sample test because no statistical method can distinguish twopopulations of continuous fuzzy data based on their respectivedistribution functions. Hence, we use the K–S two-sample testto decide whether the two independent samples of continuousfuzzy data are derived from the same population. We denote asample of continuous fuzzy data as a dataset obtained from a singlepopulation. Given two different samples of continuous fuzzy data,our goal is to test whether they have been drawn from the samepopulation. This method is useful in various applications, such asin industries, engineering, social survey, and others.

Although many papers have discussed the powerful K–S two-sample test (see discussion in Refs [2–5]), these studies havevirtually always simulated the test under known distributions.However, sometimes vague information is obtained, for example,when data are given in natural language. When using fuzzy data,the underlying distribution is not known. Moreover, it is not easyto put such information in statistical terms. Therefore, we mustestablish techniques to handle this information.

a Correspondence to: Pei-Chun Lin.E-mail: [email protected]

* The Graduate School of Information, Production and Systems, WasedaUniversity, 2-7 Hibikino, Wakamatsu-ku, Kitakyushu, Fukuoka 808-0135

** Department of Mathematical Sciences, National Chengchi University,NO.64, Sec.2, ZhiNan Rd., Wenshan District, Taipei City 11605, Taiwan(R.O.C)

To manipulate continuous fuzzy data using the K–S two-sampletest, we need to calculate the empirical distribution functionof continuous fuzzy data first. Some method is necessary toclassify all continuous fuzzy data. Many research works haveproposed various ranking methods to classify the fuzzy data. Forinstance, Lee-Kwang and Lee [6] proposed a method that derivesrankings by considering the overall possibility distributions offuzzy numbers and provides users with a method for evaluation.Tseng and Klein [7] designed an algorithm to rank any amountof fuzzy numbers. Ota et al. [8] developed a VAM to decide thetotal ordering of fuzzy numbers. Xu and Sasaki [9] proposed avertex method to calculate the distance between Grey numbers. Leeand You [10] proposed a ranking method that generates possibleranking sequences of given fuzzy numbers. Kang et al. [11]developed a new fuzzy ranking model based on user preferences.Hung et al. [12] provided a novel accuracy function to evaluateinterval-valued fuzzy information based on intuition. Moreover,Yager [13] proposed a method of ranking fuzzy numbers usinga centroid index. Although various methods have been proposedto rank fuzzy numbers, all of these methods are based on theconcept of a central point. Any of these methods thus ignoresome information about continuous fuzzy data in the calculation.Recently, Cheng [14] used the distance between fuzzy numbersto find the largest distance among data points. They considereddefuzzification of fuzzy data by using two parameters that arecalculated from the data. We thought that it would be moreeffective to analyze original fuzzy data. We will follow this conceptand combine it with the concept of fuzzy statistic in our paper.

A number of research studies have focused on fuzzy statisticalanalysis, and their applications can be found in various fields.For example, Esogbue and Song [15] proposed a defuzzificationmethod that is rigorously examined, and presented an applicationof the method to the power system stabilization problem. Chen andKlein [16] proposed an approach using defuzzification methods forthe fuzzy MADM. In addition, Wu and Sun [17] presented a classof real-life situations in which fuzzy techniques can be naturallyreformulated in statistical terms. Moreover, Watada et al. [18]built a fuzzy regression model based on fuzzy random data and

© 2013 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

P. -C. LIN ET AL.

subjected the data to some heuristics. These studies have addressedvarious problems using defuzzification techniques to choose thecentral points of fuzzy numbers. Recently, Wu and Chang [19]evaluated the mean and variance values of interval data based oncentral point and radius data, but they did not consider statisticaltests. In applying the concepts of fuzzy data to statistical tests,Lin et al. [20] also defined a weight function in terms of centralpoint and radius values, but they did not give a serious proof ofthe weight function and also did not give conventional studies tocompare with their method. Hence, we first propose to define adefuzzification formula in terms of central point and radius valuesand give a serious proof in this paper. We also present variousempirical studies.

The rest of the paper is organized as follows. Section 2 givesthe preliminaries. In Section 3, the main method is described. InSection 4, some empirical studies show that the fuzzy hypothesistesting is useful in soft computations of continuous fuzzy datain the context of social science research. Moreover, comparisonresults are given in Section 5. Finally, some conclusions and topicsfor further studies are given in Section 6.

2. Preliminary Preparation

In this section, we will introduce a conventional statistical testand some definitions we will use in the following sections.

2.1. K–S two-sample test The K–S two-sample test isdesigned to evaluate whether two independent samples have beendrawn from the same population (or from populations with thesame distribution).

To apply the K–S two-sample test, we determine the cumulativefrequency distribution for each sample of observations by using thesame intervals for both distributions. Then, for each interval, wesubtract one step function from the other. The test focuses on thelargest of these observed deviations.

Let Sm(X ) be the empirical distribution function for one sampleof size m , that is, Sm(X ) = 1

m

∑mi=1 IXi ≤x , where IXi ≤x is the

indicator function, equal to 1 if Xi ≤ x and equal to 0 otherwise.Let Sn(X ) be the empirical distribution function for the othersample of size n , that is, Sn(X ) = 1

n

∑ni=1 IXi ≤x . Thus, the K–S

two-sample test statistic is

Dm ,n = supX [Sm(X ) − Sn(X )] (1)

for a one-tailed test, and it is

Dm ,n = supX |Sm(X ) − Sn(X )| (2)

for a two-tailed test. Note that (2) uses the absolute value.In each case, the sampling distribution of Dm ,n is known.

The probabilities associated with observed values as large as theobserved Dm ,n under the null hypothesis (i.e. the two samples havecome from the same distribution) are given in Ref. [4]. In fact,there may be two sampling distributions, depending upon whetherthe test is one-tailed or two-tailed. Notice that for a one-tailedtest, we observe Dm ,n in the predicted direction using (1), andfor a two-tailed test, we observe the maximum absolute differenceDm ,n using (2), regardless of the direction. This is because, in theone-tailed test, H1 indicates that the population values relating toone of the samples are stochastically larger than the populationvalues relating to the other sample, whereas in the two-tailedtest, H1 simply indicates that the two samples are from differentpopulations.

If both m and n are 25 or less, Appendix Table LI in Ref. [21]can be used as a reference to test the null hypothesis against a one-tailed alternative, and Appendix Table in Ref. [21] can be used as areference to test the null hypothesis against a two-tailed alternative.

These tables give the values for Dm ,n that are significant at variouslevels. The critical values of the test statistic can be derived ifvalues of m , n , and mnDm ,n and whether the tests that are one-tailed are known.

When either m or n is larger than 25, Appendix Table LIII inRef. [21] may be used for the K–S two-sample test. To use thistable, determine the value of Dm ,n for observed data by using thefollowing equation:

K (α)

√m + n

mn(3)

where α is the significant level, and the value of coefficient K (α)

can be found in Table LIII of Ref. [21].In the following, we provide some definitions with respect to

membership functions and fuzzy numbers.

2.2. Definitions

2.2.1. Definition 1 Trapezoidal membership functionOne class of functions frequently used to represent linguistic termsis the class of trapezoidal membership functions μ(x; a , b, c, d),which are defined as follows:

μ(x; a , b, c, d) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎩

0, x < a and x > d

x − a

b − a, a ≤ x ≤ b

1, b ≤ x ≤ c

d − x

d − c, c ≤ x ≤ d

where A = [a , b, c, d ] is called a trapezoidal fuzzy number [22].Note that, when b = c, A is called a triangular fuzzy number;

when a = b and c = d , A is called an interval-valued number.In this paper, we determine the fuzzy data as central point and

radius by using the following definition:

2.2.2. Definition 2 Moments and Center of Mass of aPlanar Lamina Let f and g be continuous functions such thatf (x) ≥ g(x) on [a,b], and consider the planar lamina of uniformdensity ρ bounded by the graphs of y = f (x), y = g(x), anda ≤ x ≤ b [23].

(a) The moments about the x−axis and y−axis are

Mx = ρ

∫ b

a

[(f (x) + g(x))

2

][f (x) − g(x)]dx (4)

My = ρ

∫ b

ax [f (x) − g(x)]dx (5)

(b) The center of mass (x , y) is given by x = Mym and y =

Mxm , where m = ρ

∫ ba [f (x) − g(x)]dx is the mass of the

lamina.

Note that, in mathematics, a planar lamina is a closed surfaceof mass m and surface density ρ. It can be used to determinemoments of inertia or the center of mass.

In the next section, we define a function to classify continuousfuzzy data. Moreover, we construct a procedure to use the K–Stwo-sample test for continuous fuzzy data.

3. Identifying the Distribution Difference betweenTwo Populations of Fuzzy Data Based on aNonparametric Statistical Method

To identify the distribution difference between two populationsof fuzzy data, we purpose a statistical pivot for the K–S two-sample test for two continuous fuzzy datasets. The first step is todefine a function that can realize continuous fuzzy data.

592 IEEJ Trans 8: 591–598 (2013)

IDENTIFYING THE DISTRIBUTION DIFFERENCE BETWEEN TWO POPULATIONS OF FUZZY DATA

u(x)

f(x)

h(x)

Membership function

1

o

x0 a b c d

Fig. 1. Trapezoidal fuzzy number f (x) with central point o andradius l having the same area as h(x)

3.1. Realization of continuous fuzzy data To calcu-late the empirical distribution function for continuous fuzzy data,we must classify continuous fuzzy data. We first give some proper-ties about the central point and radius of continuous fuzzy data anddefine a function for continuous fuzzy data after these properties.Then we use it to derive a new classification.

Note that by Definition 2, we know that for a trapezoidal fuzzynumber

o = My

m=

ρ

∫ d

ax [f (x) − g(x)]dx

ρ

∫ d

a[f (x) − g(x)]dx

=

∫ d

axf (x)dx

∫ d

af (x)dx

where g(x) = 0. Note that m = ∫ da f (x)dx is the area between the

membership function f (x) and the x -axis.When we know the values of o, we can define a new

membership function based on the central point o. Its membershipfunction is as follows:

h(x) ={

1, o − l ≤ x ≤ o + l0, otherwise

We say that the membership functions f (x) and h(x) have thesame area between x -axis (Fig. 1).

By the mean value theorem for definite integrals [24], we havem = ∫ d

a f (x)dx = ∫ o+lo−l h(x)dx = 2l i.e.

l =∫ d

a f (x)dx

2

Now, we give some properties, central point, and radius ofcontinuous fuzzy data, in the following.

Property 1 Let x = [a , b] be an interval value, then itsmembership function is

f (x) ={

1, a ≤ x ≤ b0, otherwise

Moreover, o =∫ ba xdx∫ ba dx

= a+b2 and l =

∫ ba dx

2 = b−a2

Property 2 Let x = [a , b, c] be a triangular fuzzy number, thenits membership function is

f (x) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0, x < a and x > cx − a

b − a, a ≤ x ≤ b

c − x

c − b, b ≤ x ≤ c

Moreover,

o =

∫ b

ax ∗ x − a

b − adx +

∫ c

bx ∗ c − x

c − bdx

∫ b

a

x − a

b − adx +

∫ c

b

c − x

c − bdx

=1

6(c − a)(a + b + c)

c − a

2

= a + b + c

3and

l =

∫ b

a

x − a

b − adx +

∫ c

b

c − x

c − bdx

2=

c − a

22

= c − a

4

Property 3 Let x = [a , b, c, d ] be a trapezoidal fuzzy number,then its membership function is

f (x) =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

0, x < a and x > dx − a

b − a, a ≤ x ≤ b

1, b ≤ x ≤ cd − x

d − c, c ≤ x ≤ d

Moreover,

o =

∫ b

ax ∗ x − a

b − adx +

∫ c

bxdx +

∫ d

cx ∗ d − x

d − cdx

∫ b

a

x − a

b − adx +

∫ c

bdx +

∫ d

c

d − x

d − cdx

=1

6[(c + d)2 − (a + b)2 + (ab − cd)]

1

2[(c + d) − (a + b)]

= (c + d)2 − (a + b)2 + ab − cd

3[(c + d) − (a + b)]

and

l =

∫ b

a

x − a

b − adx +

∫ c

bdx +

∫ d

c

d − x

d − cdx

2

=12 [(c + d) − (a + b)]

2= (c + d) − (a + b)

4

3.1.1. Definition 3 Realization of continuous fuzzy dataLet x ≡ (o; l) be a continuous fuzzy value on U , which is the

universal set, o = Mym be the central point, and l = m

2 be the radiusof fuzzy data. My is the moment about the y-axis and m is the massof the lamina in Definition 2. The realization of the continuousfuzzy value (RFx ) is defined as follows:

RFx = o + [1 − e−l ] (6)

which is used to rank the fuzzy data.It is straightforward to see that the function is a well-defined

function because it satisfies the axioms for the order relations.It remains to prove the transitive law. That is, to prove that

‘if RFx1 < RFx2 and RFx2 < RFx3 , then RFx1 < RF ′x3

. We give asimple proof as follows:

Proof: Let x1, x2, and x3 are continuous fuzzy data. We cancalculate the central point and radius by Properties 1–3. Hence,these expressions result in x1 ≡ (o1, l1), x2 ≡ (o2, l2), and x3 ≡(o3, l3).

Suppose that we have RFx1 < RFx2 and RFx2 < RFx3 .

593 IEEJ Trans 8: 591–598 (2013)

P. -C. LIN ET AL.

It means that o1 + [1 − e−l1 ] < o2 + [1 − e−l2 ] and o2 + [1 −e−l2 ] < o3 + [1 − e−l3 ].

Therefore, o1 − e−l1 < o2 − e−l2 and o2 − e−l2 < o3 − e−l3 .Hence, we get o1 − e−l1 < o3 − e−l3 .By adding 1 to both sides, we get o1 + [1 − e−l1 ] < o3 + [1 −

e−l3 ].That is, RFx1 < RFx3 . This completes the proof.

3.1.2. Definition 3 Explanation of Definition 3 In orderto defuzzify the continuous fuzzy data, we define a function RFx ,which is composed of the central point o and radius l . The conceptof (6) comes from the mass of a planar lamina (see Definition 2).It is a function that can calculate the weight of the area in space.In our case, we use the function to calculate the weight of thearea between the membership function and the y-axis. In fact, thecentral point holds most of the weight of fuzzy data x . If we donot consider the radius l in our function RFx , the function will bedefined by only one parameter (central point).

In this paper, to introduce more information of fuzzy data, wehave added the other parameter (radius). The radius can give anumber of weights in the function RFx . Therefore, we added anincreasing function 1 − e−l . This function is combined with anexponential function e−l , which can extend the distance of betweentwo data. Hence, we write a function RFx which is calculated bythe weigh using the central point and radius. Moreover, we rankthe fuzzy data by the weight function RFx .

Now, we define a ranking criterion as follows:

3.1.3. Definition 4 Ranking criterion If x1, x2 are fuzzydata, we define the following ranking criterion.

(a) RFx1 < RFx2 if and only if x1 ≺ x2.

(b) RFx1 = RFx2 if and only if x1 ≈ x2.

(c) RFx1 > RFx2 if and only if x1 x2.

3.1.4. Definition 5 We say that, if x1 ≈ x2, it means thatx1 and x2 are in the same class. Moreover, if x1 ≺ x2 or x1 x2,it means that they are in different classes.

For example, let x1 = [5, 8, 16] and x2 = [4, 10, 15]. We cancalculate the central point and radius by Property 2. Hence we havex1 ≡ (9.67, 2.75) and x2 ≡ (9.67, 2.75). Moreover, RFx1 = RFx2 .We say that x1 ≈ x2 and they are in the same class.

For another example, let x1 = [5, 8, 16] and x2 = [5, 7, 20]. Wecan calculate the central point and radius by Property 2. Hence wehave x1 ≡ (9.67, 2.75) and x2 ≡ (10.67, 3.75). Moreover, RFx1 <

RFx2 . We say that x1 ≺ x2 and they are in different classes.

3.1.5. Definition 6 Empirical distribution function withcontinuous fuzzy value Let x1, x2, . . . , xn be n continuousfuzzy data. We can use the function RFxi to rank the fuzzy dataxi and separate them into different classes ci , which are calledGlivenko–Cantelli classes (see discussion in Refs [25–27]).

Therefore, we have the order statistic of RFxi denoted by

RFx(1)< RFx(2)

< ... < RFx(n)(7)

Hence, the empirical distribution function can be generalized toa set C to obtain an empirical measure indexed by ci .

Sn(ci ) = 1

n

n∑i=1

Ici (RFxi ), ci ∈ C (8)

where Ici is the indicator function denoted by

Ici (RFxi ) ={

1, RFxi ∈ ci ,0, RFxi /∈ ci

, ∀i = 1, 2, . . . , n (9)

With these definitions, we can now turn to distinguish twopopulations of fuzzy data based on a nonparametric statisticalmethod by proposing the K–S two-sample test for continuousfuzzy data.

3.2. Identifying the distribution difference betweentwo populations of fuzzy data based on a nonparametricstatistical method In this section, we introduce a nonpara-metric statistical method for continuous fuzzy data, namely, theK–S two-sample test for continuous fuzzy data. The K–S two-sample test is used to decide whether two independent sampleshave been drawn from the same population. The test focuses on theagreement between two cumulative distributions. However, howshould the K–S two-sample test be adapted for continuous fuzzydata? To address this question, we have developed a new methodto derive empirical distribution functions for the continuous fuzzydata in order to find the statistical pivot of the K–S two-sampletest. The procedure of the K–S two-sample test for continuousfuzzy data is as follows:

3.2.1. Samples Let Xm and Yn be two samples with contin-uous fuzzy data. Xm has size m and Yn has size n . Combining allobservations results in N = m + n data points. A value of the func-tion RF can be found that will allow us distribute Xm and Yn intodifferent classes Ci , which may be in the same class. The numberof classes is less than or equal to N . Moreover, the two empiricaldistribution functions Xm and Yn can be derived separately.

3.2.2. Hypothesis The two samples have the same distri-bution H0. i.e.

H0 : F1(x) = F2(x)

andH1 : F1(x) �= F2(x)

where F1(x) denotes the distribution function of one fuzzy sampleof size m , and F2(x) denotes the distribution function of the otherfuzzy sample of size n .

3.2.3. Statistics Dm ,n = supX |Sm(X ) − Sn(X )|, where Sm

(X ) is the observed cumulative distribution for one sample of sizem , and Sn(X ) is the observed cumulative distribution for the othersample of size n .

3.2.4. Decision rule The significance level α is stipulated,and Appendix Table LII in Ref. [21] is used for small samples andAppendix Table LIII in Ref. [21] is used for large samples formaking a decision.

Similar to the K–S two-sample test with real numbers, the K–Stwo-sample test with continuous fuzzy data also focuses on theagreement between two cumulative distributions. If the two fuzzysamples have been indeed drawn from the same population, thenthe cumulative distributions of the fuzzy samples should be close toeach other. If the two fuzzy sample cumulative distributions are toofar apart in any interval, this implies that the fuzzy samples comefrom different populations. Therefore, if the two fuzzy samplecumulative distributions have a large deviation, then H0 is rejected.

4. Empirical Studies

Example 1: A Japanese mobile telecommunications companywants to investigate how many times college students send e-mailsby mobile phone per day. A manager decides to make a fuzzyquestionnaire. A sample was randomly selected of 20 customers(10 males and 10 females) who study at Waseda Universityin Kitakyushu. The investigator asked them fill the followingquestionnaire: (i) You send e-mail times (interval) bymobile phone per day. (ii) You send e-mail more than

594 IEEJ Trans 8: 591–598 (2013)


Table I. Number of e-mails per day by male and female students

Males [0,6,9,10] [6,6,9,10] [3,6,9,10] [5,6,8,8] [5,5,8,10][5,5,8,8] [5,6,7,8] [5,6,8,8] [5,7,16,16] [4,6,12,15]

Females [4,5,7,7] [5,5,7,10] [5,5,10,10] [15,15,25,30] [5,5,7,7][5,7,8,8] [4,4,9,15] [5,5,7,8] [5,6,15,20] [5,7,15,20]

Table II. Values for oi , li , RF , and Ci

[ai ,bi ,ci ,di ] oi li RF Ci

X1 [0,6,9,10] 6.03 3.25 6.03 + [1 − e−3.25] 5X2 [6,6,9,10] 7.76 1.75 7.76 + [1 − e−1.75] 13X3 [3,6,9,10] 6.93 2.50 6.93 + [1 − e−2.50] 10X4 [5,6,8,8] 6.73 1.25 6.73 + [1 − e−1.25] 7X5 [5,5,8,10] 7.04 2.00 7.04 + [1 − e−2.00] 11X6 [5,5,8,8] 6.50 1.50 6.50 + [1 − e−1.50] 3X7 [5,6,7,8] 6.50 1.00 6.50 + [1 − e−1.00] 6X8 [5,6,8,8] 6.73 1.25 6.73 + [1 − e−1.25] 7X9 [5,7,16,16] 10.98 5.00 10.98 + [1 − e−5.00] 16X10 [4,6,12,15] 9.27 4.25 9.27 + [1 − e−4.25] 15Y1 [4,5,7,7] 5.73 1.25 5.73 + [1 − e−1.25] 1Y2 [5,5,7,10] 6.86 1.75 6.86 + [1 − e−1.75] 9Y3 [5,5,10,10] 7.50 2.50 7.50 + [1 − e−2.50] 12Y4 [15,15,25,30] 21.33 6.25 21.33 + [1 − e−6.25] 19Y5 [5,5,7,7] 6.00 1.00 6.00 + [1 − e−1.00] 2Y6 [5,7,8,8] 6.92 1.00 6.92 + [1 − e−1.00] 8Y7 [4,4,9,15] 8.19 4.00 8.19 + [1 − e−4.00] 14Y8 [5,5,7,8] 6.27 1.25 6.27 + [1 − e−1.25] 4Y9 [5,6,15,20] 11.58 6.00 11.58 + [1 − e−6.00] 17Y10 [5,7,15,20] 11.83 5.75 11.83 + [1 − e−5.75] 18

times (real numbers) by mobile phone per day. (iii) You send e-mail less than times (real numbers) by mobile phone perday. We collected those data and got trapezoidal fuzzy numbers.The answers are shown in Table 1.

First, we calculated oi and li , and then we found the values ofRF and compared them. Moreover, we determined to which classthey belong. These calculations are shown in Table 2.

After comparing the values of RF , we obtained results as the fol-lowing inequality: RFY1 < RFY5 < RFX6 < RFY8 < RFX1 < RFX7< RFX4 = RFX8 < RFY6 < RFY2 < RFX3 < RFX5< RFY3 < RFX2 < RFY7 < RFX10 < RFX9 < RFY9< RFY10 < RFY4

From the above, we derived 19 classes. Then, we found thecumulative distributions of Xi and Yi .

From Table 3, the following test statistic was obtained:

D = sup|S10(X ) − S10(Y )| = 0.3

at a significance level of α = 0.05, mnD = 10 × 10 × (0.3) =30 < 70 (see Appendix Table LII in Ref. [21]). Because theobserved value did not exceed the critical value, we could notreject H0. We conclude that male and female students report thesame interval for the number of times they send e-mail by mobilephone per day.

Example 2: With the same problem in Example 1, we get thesame data in Table 1. We used the weight function (WF ), which

Table IV. Values for oi , li , WF , and Ci

[ai ,bi ,ci ,di ] oi li WF Ci

X1 [0,6,9,10] 6.03 3.25 6.03[1 + ke−6.50] 1X2 [6,6,9,10] 7.76 1.75 7.76[1 + ke−3.50] 11X3 [3,6,9,10] 6.93 2.50 6.93[1 + ke−5.00] 2X4 [5,6,8,8] 6.73 1.25 6.73[1 + ke−2.50] 15X5 [5,5,8,10] 7.04 2.00 7.04[1 + ke−4.00] 6X6 [5,5,8,8] 6.50 1.50 6.50[1 + ke−3.00] 12X7 [5,6,7,8] 6.50 1.00 6.50[1 + ke−2.00] 18X8 [5,6,8,8] 6.73 1.25 6.73[1 + ke−2.50] 15X9 [5,7,16,16] 10.98 5.00 10.98[1 + ke−10.00] 7X10 [4,6,12,15] 9.27 4.25 9.27[1 + ke−8.50] 5Y1 [4,5,7,7] 5.73 1.25 5.73[1 + ke−2.50] 13Y2 [5,5,7,10] 6.86 1.75 6.86[1 + ke−3.50] 10Y3 [5,5,10,10] 7.50 2.50 7.50[1 + ke−5.00] 4Y4 [15,15,25,30] 21.33 6.25 21.33[1 + ke−12.50] 16Y5 [5,5,7,7] 6.00 1.00 6.00[1 + ke−2.00] 17Y6 [5,7,8,8] 6.92 1.00 6.92[1 + ke−2.00] 19Y7 [4,4,9,15] 8.19 4.00 8.19[1 + ke−8.00] 3Y8 [5,5,7,8] 6.27 1.25 6.27[1 + ke−2.50] 14Y9 [5,6,15,20] 11.58 6.00 11.58[1 + ke−12.00] 8Y10 [5,7,15,20] 11.83 5.75 11.83[1 + ke−11.50] 9

was proposed by Lin et al. in Ref. [20], to defuzzify the fuzzy data.We denoted this method as the WF method. The weight functionis defined as follows:

Wi ≡ W (oi , li ) = oi [1 + ke−2li ], ∀i = 1, 2, 3, . . . ,

where oi is the central point, li is the radius with respect to oi ,and k = maxi (oi + li ) − minj (oj − lj ), ∀i , j = 1, 2, 3 . . . .

We determine to which class they belong. The results are shownin Table 4.

After comparing the values of WF , we obtained results in theform of the following inequality:WFX1 < WFX3 < WFY7 < WFY3 < WFX10 < WFX5< WFX9 < WFY9 < WFY10 < WFY2 < WFX2< WFX6 < WFY1 < WFY8 < WFX4 = WFX8 < WFY4< WFY5 < WFX7 < WFY6



D = sup|S10(X ) − S10(Y )| = 0.3

at a significance level of α = 0.05, mnD = 10 ∗ 10 ∗ (0.3) =30 < 70 (see Appendix Table LII [21]). Because the observedvalue did not exceed the critical value, we could not reject H0.We concluded that males and females report the same intervalfor the numbers of times they send e-mail by mobile phoneper day.

Example 3 With same problem in Example 1, we get the samedata in Table 1. We used the central point oi to defuzzify the fuzzydata. This is the conventional method. We denote this method asthe C method. Moreover, we determine to which class they belong.The results are shown in Table 6.

Table III. Cumulative distributions of Xi and Yj

Ci 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

S10(X ) 0 0 .1 .1 .2 .3 .5 .5 .5 .6 .7 .7 .8 .8 .9 1 1 1 1S10(Y ) .1 .2 .2 .3 .3 .3 .3 .4 .5 .5 .5 .6 .6 .7 .7 .7 .8 .9 1|S10(X ) − S10(Y )| .1 .2 .1 .2 1 0 .2 .1 0 .1 .2 .1 .2 .1 .2 .3 .2 .1 0

595 IEEJ Trans 8: 591–598 (2013)

P. -C. LIN ET AL.

Table V. Cumulative distributions of Xi and Yj

Ci 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

S10(X ) .1 .2 .2 .2 .3 .4 .5 .5 .5 .5 .6 .7 .7 .7 .9 .9 .9 1 1S10(Y ) 0 0 .1 .2 .2 .2 .2 .3 .4 .5 .5 .5 .6 .7 .7 .8 .9 .9 1|S10(X ) − S10(Y )| .1 .2 .1 0 1 .2 .3 .2 .1 0 .1 .2 .1 0 .2 .1 0 .1 0

Table VI. Values for C and Ci

[ai ,bi ,ci ,di ] oi C Ci

X1 [0,6,9,10] 6.03 6.03 3X2 [6,6,9,10] 7.76 7.76 12X3 [3,6,9,10] 6.93 6.93 9X4 [5,6,8,8] 6.73 6.73 6X5 [5,5,8,10] 7.04 7.04 10X6 [5,5,8,8] 6.50 6.50 5X7 [5,6,7,8] 6.50 6.50 5X8 [5,6,8,8] 6.73 6.73 6X9 [5,7,16,16] 10.98 10.98 15X10 [4,6,12,15] 9.27 9.27 14Y1 [4,5,7,7] 5.73 5.73 1Y2 [5,5,7,10] 6.86 6.86 7Y3 [5,5,10,10] 7.50 7.50 11Y4 [15,15,25,30] 21.33 21.33 18Y5 [5,5,7,7] 6.00 6.00 2Y6 [5,7,8,8] 6.92 6.92 8Y7 [4,4,9,15] 8.19 8.19 13Y8 [5,5,7,8] 6.27 6.27 4Y9 [5,6,15,20] 11.58 11.58 16Y10 [5,7,15,20] 11.83 11.83 17

After comparing the values of central point oi , we obtain resultsin the form of the following inequality:

CY1 < CY5 < CX1 < CY8 < CX6 = CX7 < CX4 = CX8< CY2 < CY6 < CX3 < CX5 < CY3 < CX2 < CY7< CX10 < CX9 < CY9 < CY10 < CY4



D = sup|S10(X ) − S10(Y )| = 0.3

at a significance level of α = 0.05, mnD = 10 ∗ 10 ∗ (0.3) = 30 <

70 (see Appendix Table LII [21]). Because the observed value didnot exceed the critical value, we could not reject H0. We concludedthat male and female students report the same interval for thenumbers of times they send e-mail by mobile phone per day.

Example 4: With the same problem as in Example 1, we get thesame data in Table 1. We use the distance method developed byCheng [14]. We denote this method as the RD method. We showthe result in Table 8.

Note that we have the membership function f (x) as in Property

3. Here, x i = oi and yi =∫ 10 ygL

X dy+∫ 10 ygR

X dy∫ 10 gL

X dy+∫ 10 gR

X dy= (a+d)+2(b+c)

3(a+b+c+d), where

Table VIII. Values for x i , yi , RD , and Ci

[ai ,bi ,ci ,di ] x i y i RD Ci

X1 [0,6,9,10] 6.03 0.53√

(6.03)2 + (0.53)2 3X2 [6,6,9,10] 7.76 0.49

√(7.76)2 + (0.49)2 12

X3 [3,6,9,10] 6.93 0.51√

(6.93)2 + (0.51)2 9X4 [5,6,8,8] 6.73 0.51

√(6.73)2 + (0.51)2 6

X5 [5,5,8,10] 7.04 0.49√

(7.04)2 + (0.49)2 10X6 [5,5,8,8] 6.50 0.50

√(6.50)2 + (0.50)2 5

X7 [5,6,7,8] 6.50 0.50√

(6.50)2 + (0.50)2 5X8 [5,6,8,8] 6.73 0.51

√(6.73)2 + (0.51)2 6

X9 [5,7,16,16] 10.98 0.51√

(10.98)2 + (0.51)2 15X10 [4,6,12,15] 9.27 0.50

√(9.27)2 + (0.50)2 14

Y1 [4,5,7,7] 5.73 0.51√

(5.73)2 + (0.51)2 1Y2 [5,5,7,10] 6.86 0.48

√(6.86)2 + (0.48)2 7

Y3 [5,5,10,10] 7.50 0.50√

(7.50)2 + (0.50)2 11Y4 [15,15,25,30] 21.33 0.49

√(21.33)2 + (0.49)2 18

Y5 [5,5,7,7] 6.00 0.50√

(6.00)2 + (0.50)2 2Y6 [5,7,8,8] 6.92 0.51

√(6.92)2 + (0.51)2 8

Y7 [4,4,9,15] 8.19 0.48√

(8.19)2 + (0.48)2 13Y8 [5,5,7,8] 6.27 0.49

√(6.27)2 + (0.49)2 4

Y9 [5,6,15,20] 11.58 0.49√

(11.58)2 + (0.49)2 16Y10 [5,7,15,20] 11.83 0.49

√(11.83)2 + (0.49)2 17

gLX = a + (b − a)y and gR

X = d + (c − d)y are inverse functions

∀y ∈ [0, 1]. Moreover, R(Xi ) =√

x i2 + yi

2.

After comparing the values of R(Xi ), we obtain results in theform of the following inequality:RDY1 < RDY5 < RDX1 < RDY8 < RDX6 = RDX7< RDX4 = RDX8 < RDY2 < RDY6 < RDX3 < RDX5< RDY3 < RDX2 < RDY7 < RDX10 < RDX9 < RDY9< RDY10 < RDY4



D = sup|S10(X ) − S10(Y )| = 0.3

at a significance level of α = 0.05, mnD = 10 ∗ 10 ∗ (0.3) = 30 <

70 (see Appendix Table LII in Ref. [21]). Because the observedvalue did not exceed the critical value, we could not reject H0. Weconclude that male and female students report the same intervalfor the numbers of times they send e-mail by mobile phone perday.

Table VII. Cumulative distributions of Xi and Yj

Ci 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

S10(X ) 0 0 .1 .1 .3 .5 .5 .5 .6 .7 .7 .8 .8 .9 1 1 1 1S10(Y ) .1 .2 .2 .3 .3 .3 .4 .5 .5 .5 .6 .6 .7 .7 .7 .8 .9 1|S10(X ) − S10(Y )| .1 .2 .1 .2 0 .2 .1 0 .1 .2 .1 .2 .1 .2 .3 .2 .1 0

596 IEEJ Trans 8: 591–598 (2013)


Table IX. Cumulative distributions of Xi and Yj

Ci 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

S10(X ) 0 0 .1 .1 .3 .5 .5 .5 .6 .7 .7 .8 .8 .9 1 1 1 1S10(Y ) .1 .2 .2 .3 .3 .3 .4 .5 .5 .5 .6 .6 .7 .7 .7 .8 .9 1|S10(X ) − S10(Y )| .1 .2 .1 .2 0 .2 .1 0 .1 .2 .1 .2 .1 .2 .3 .2 .1 0

Table X. Values For RF , Ci of RF , WF , Ci of WF , C , Ci of C ,RD , and Ci of RD

Ci Ci Ci CiRF of RF WF of F C of C RD of RD

X1 6.99 5 6.25 1 6.03 3 6.05 3X2 8.59 13 13.57 11 7.76 12 7.78 12X3 7.85 10 8.09 2 6.93 9 6.95 9X4 7.44 7 20.43 15 6.73 6 6.75 6X5 7.90 11 10.24 6 7.04 10 7.06 10X6 7.28 3 14.53 12 6.50 5 6.52 5X7 7.13 6 28.32 18 6.50 5 6.52 5X8 7.44 7 20.43 15 6.73 6 6.75 6X9 11.97 16 10.99 7 10.98 15 10.99 15X10 10.26 15 9.32 5 9.27 14 9.28 14Y1 6.44 1 17.39 13 5.73 1 5.75 1Y2 7.69 9 12.00 10 6.86 7 6.88 7Y3 8.42 12 8.75 4 7.50 11 7.52 11Y4 22.33 19 21.33 16 21.33 18 21.34 18Y5 6.63 2 26.14 17 6.00 2 6.02 2Y6 7.55 8 30.15 19 6.92 8 6.94 8Y7 9.17 14 8.26 3 8.19 13 8.20 13Y8 6.98 4 19.03 14 6.27 4 6.29 4Y9 12.58 17 11.58 8 11.58 16 11.59 16Y10 12.83 18 11.83 9 11.83 17 11.84 17

Table XI. Results of K–S two-sample test in each method

Defuzzification The statistic Decisionformula Classes of K–S of K–S

RF Method o + [1 − e−l ] 19 0.3 AcceptWF Method o × [1 + ke−2l ] 19 0.3 AcceptC Method o 18 0.3 Accept

RD Method R(Xi ) 18 0.3 Accept

5. Comparison

To clarify the results of each method with the K–S two sampletest, we give the values of four defuzzification formulas andclassifications Ci in Table 10. Moreover, we give the comparisonresults of each method in Table 11. From Table 11, we can see thatwe get the same statistic and decision of the K–S two sample testin all methods. The differences between the methods are that thedefuzzification formulas and the number of classes are different.The RF and WF methods are more than one class above the C andRD methods. It means that RF and RD methods are more effectivein classifying the fuzzy data because they divide the fuzzy datainto more subdivisions.

It can be seen from the value Ci of C and Ci of RD in Table 10that the C and RD methods obtained the same value. It meansthat although Cheng (RD method) used two variables, Xi (thevalue on the horizontal axis) and Yi (the value on the verticalaxis), to determine the RD value, but the value Yi did not giveany effectiveness in calculating the value of Ci . Hence, we thoughtthat this method (RD method) did not have any contribution thanconventional method (C method).

From Table 10, we could also see that the value of Ci of WFwas decreasing while the other three values of Ci (Ci of RF , Ci ofC and Ci of RD) were all increasing.

For example, we took three trapezoidal fuzzy numbers X5, X7,and X8, and the order by each method is given as follows:

RFX7 < RFX8 < RFX5 : 6 < 7 < 11

WFX7 > WFX8 > WFX5 : 18 > 15 > 6

CX7 < CX8 < CX5 : 5 < 6 < 10

RDX7 < RDX8 < RDX5 : 5 < 6 < 10

It means that when we use two variables (central point oi andradius ri ) in the WF method, the value of the weight function isdecreasing when the value of ri is increasing. It is not satisfying theranking criterion in Definition 4. Moreover, we cannot calculatethe order statistic in Definition 6 by using this weight function.

Although WF and RF methods provide the same results andcan also divide the fuzzy data into more subdivisions, we thoughtthat it is not suitable to use WF method with the K–S two sampletest here because the defuzzification formula is not satisfied byDefinitions 4 and 6. Hence, we conclude that our method (RFmethod) is more effective and successful with the K–S two sampletest.

6. Conclusions

In various research studies, many decisions, evaluations, andpsychological tests are conducted using surveys and/or question-naires to seek people’s opinions. It is routine to ask people abouttheir opinions according to binary, multiple-choice questions, butpeople actually have complex and/or vague thoughts. If we wantto understand human thinking in reality, we must extend classicalstatistical models to address fuzzy numbers.

To identify the distribution difference between two populationsof fuzzy data, we presented a new function RFx in this paper,which consists of both central points and radius values. Thefunction RFx can classify all continuous fuzzy data. Moreover,we can differentiate two fuzzy samples. Hence, a cumulativedistribution function can be found out. Therefore, we obtain thestatistical pivot of the K–S two-sample test with continuous fuzzydata. We also provided several empirical studies to compare theproposed method with conventional methods. In WF , RD , and RFmethods, all of them want to rank fuzzy data based not only onone point but also on the other parameter. The difference betweenthese three methods is that Cheng (RD method) chose the centralpoint by calculating the moment about the x -value and the otherparameter (y-value) by using an inverse function, but we (WFmethod and RF method) chose the central point by calculatingthe moment about the x -value and the other parameter (radius) bycalculating the area between membership function and the x -value.

In this paper, we provided a method to classify the continuousfuzzy data that enables us to use the K–S two sample testto identify the distribution difference between two populations.Through this procedure, an intelligent calculation method can beapplied to analyze industrial, physiologic, economic, or financialdata in the future.

597 IEEJ Trans 8: 591–598 (2013)

P. -C. LIN ET AL.

References

(1) Conover WJ. Practical Nonparametric Statistics . John Wiley andSons: New York; 1971.

(2) Dixon WJ. Power under normality of several nonparametric tests. TheAnnals of Mathematical Statistics 1954; 25(3):610–614.

(3) Epstein B. Comparison of some non-parametric tests against normalalternatives with an application to life testing. Journal of the AmericanStatistical Association 1955; 50(271):894–900.

(4) Massey F. The Kolmogorov-Smirnov Test for Goodness of Fit.Journal of the American Statistical Association 1951; 46(253):68–78.

(5) Schroer G, Trenkler D. Exact and randomization distributions ofKolmogorov-Smirnov tests two or three samples. ComputationalStatistics and Data Analysis 1995; 20(2):185–202.

(6) Lee-Kwang H, Lee JH. A method for ranking fuzzy numbers and itsapplication to decision-making. IEEE Transactions on Fuzzy Systems1999; 7(6):677–685.

(7) Tseng TY, Klein CM. New algorithm for the ranking procedure infuzzy decision making. IEEE Transactions on Systems, Man, andCybernetics 1989; 19(5):1289–1296.

(8) Ota K, Taira N., Miyagi H. Group Decision-Making Model in FuzzyAHP Based on the Variable Axis Method. IEEJ Transactions onElectronics, Information and Systems 2008; 128(2):303–309.

(9) Xu J, Sasaki M. Technique of order preference by similarityfor multiple attribute decision making based on grey members.IEEJ Transactions on Electronics, Information and Systems 2004;124(10):1999–2005.

(10) Lee J-H, You K-H. A fuzzy method for fuzzy numbers. IEICETransactions on Fundamentals of Electronics, Communications andComputer Sciences 2003; E86-A(10):2650–2658.

(11) Kang B-Y, Kim D-W, Li Q. Fuzzy ranking model based on userpreference. IEICE Transactions on Information and Systems 2006;E89-D(6):1971–1974.

(12) Hung K-C, Tsai Y-C, Lin K-P., Julian P. A novel measured functionfor MCDM problem based on interval-valued intuitionistic fuzzysets. IEICE Transactions on Information and Systems 2010; E93-D(11):3056–3065.

(13) Yager RR. A rocedure for ordering fuzzy subsets of the unit interval.Information Sciences 1981; 24(2):143–161.

(14) Cheng CH. A new approach for ranking fuzzy numbers by distancemethod. Fuzzy Sets and Systems 1998; 95(3):307–317.

(15) Esogbue AO, Song Q. On optimal defuzzification and learningalgorithms: theory and applications. Fuzzy Optimization and DecisionMaking 2003; 2(4):283.

(16) Chen C-B, Klein CM. An efficient approach to solving fuzzy MADMproblems. Fuzzy Sets and Systems 1997; 88(1):51–67.

(17) Wu B, Sun C. Interval-valued statistics, fuzzy logic, and their usein computational semantics. Journal of Intelligent and Fuzzy Systems2001; 11(1–2):1–7.

(18) Watada J, Wang S, Pedrycz W. Building confidence-interval-basedfuzzy random regression models. IEEE Transactions on Fuzzy Sys-tems 2009; 17(6):1273–1283.

(19) Wu B, Chang SK. On testing hypothesis of fuzzy sample mean.Japan Journal of Industrial and Applied Mathematics 2007; 24(2):197–209.

(20) Lin P-C, Wu B, Watada J. Kolmogorov-Smirnov Two Sample Testwith Continuous Fuzzy Data. Integrated Uncertainty Managementand Applications, AISC 2010; 68:175–186.

(21) Siegel S, Castellan NJ. Nonparametric Statistics for the BehavioralSciences . 2nd ed. McGraw-Hill, New York; 1988.

(22) Klir GJ, Yuan B. Fuzzy Sets and Fuzzy Logic: Theory and Applica-tions . Prentice Hall PTR, New Jersey; 1995.

(23) Larson R, Hostetler R, Edwards BH. Essential Calculus: EarlyTranscendental Functions . Houghton Mifflin Company, Boston, NewYork; 2008.

(24) George Thomas B. Jr. Thomas’ Calculus . 11th ed. Pearson Education,Inc.: Boston, San Francisco, New York; 2005.

(25) Gaenssler P, Stute W. Empirical processes: a survey of results forindependent and identically distributed random variables. Annals ofProbability 1979; 7(2):193–243.

(26) Gine E, Zinn J. Some limit theorems for empirical measures (withdiscussion). Annals of Probability 1984; 12(4):929–989.

(27) Serfling RJ. Approximation Theorems of Mathematical Statistics . JohnWiley and Sons, New York; 1980.

Pei-Chun Lin (Non-member) received the B.Sc. degree fromthe Department of Mathematics, National Kaoh-siung Normal University, Kaohsiung, Taiwan,and the M.Sc. degrees from the Department ofMathematical Sciences, National Chengchi Uni-versity, Taipei, Taiwan. She is currently workingtoward the Ph.D. degree in engineering at theGraduate School of Information, Production, and

Systems, Waseda University, Fukuoka, Japan. Her research inter-ests include soft computing, intelligent knowledge, fuzzy statisticalanalysis, and management engineering.

Junzo Watada (Member) (M’87) received the B.Sc. and M.Sc.degrees in electrical engineering from OsakaCity University, and the Ph.D. degree fromOsaka Prefecture University, Osaka, Japan. Cur-rently, he is a Professor of management engi-neering, knowledge engineering, and soft com-puting at the Graduate School of Information,Production and Systems, Waseda University,

Fukuoka, Japan. His research interests include soft computing,tracking system, knowledge engineering, and management engi-neering.

Berlin Wu (Non-member) received the Ph.D. degree in statis-tics from Indiana University, Bloomington, Indi-ana, USA, in 1988. He is currently a Professorwith the Department of Mathematical Sciences,National Chengchi University, Taiwan. His cur-rent research interests include nonlinear timeseries analysis, fuzzy statistical analysis, and dis-covering various forecasting techniques with soft

computing methods. Prof. Wu was a recipient of the FulbrightScholarship in 1997 and of the Distinguished Research ProfessorPrize from the National Chengchi University in 2002 and 2004.

598 IEEJ Trans 8: 591–598 (2013)

Identifying the distribution difference between two populations of fuzzy data based on a...

Documents

Transcript of Identifying the distribution difference between two populations of fuzzy data based on a...