Sampling design issues in Italian experience on scanner data and the possible integration with...

Sampling design issues in Italian experience on scanner data and thepossible integration with microdata coming from traditional data collection

Claudia De Vitiis

In collaboration with: C. Casciano, N. Cibella, A. Guandalini, F. Inglese, G. Seri, M. Terribili, F. Tiero

ISTAT - ITALY

Workshop scanner data. Rome 1-2 October 2015

Summary

1. Aims of the presentation

2. The new general sampling design for CPI

3. The context of the first experiments of sampling from Scanner Data

4. Selection of elementary items from Scanner Data

5. First results

6. Open issues and conclusions


Scanner data is a big opportunity to introduce improvements in the CPI compiling not only for the data collection but also with regards the sampling perspective

This presentation focuses on a comparison among indices of elementary aggregates compiled using different sub-sets of series obtained through different “selection schemes”

Elementary Index Bias

Elementary Index Sampling Variance

First experiment on a small data set:

One province, some consumption segments (Italian COICOP6), permanent series

1. Aims of the presentation


The current sample strategy of the CPI survey (at territorial level)

Three purposive sampling stages:

– The first stage units are the chief towns of provinces (established

by law)

– The second stage units are the outlets, purposively chosen in

each PSU to be representative of the consumer behaviour

– The most sold items of a fixed basked of products are observed

in the sample outlets

The elementary indices are obtained at municipality level by unweighted geometric mean

The general index is calculated by subsequent aggregation of elementary indices, using weights at different levels based on population and national account data on consumer expenditure



A working group established at ISTAT is developing a probability sample strategy

Based on the hypothesis that turnover is a good proxy of final household consumption (expenditure)

Outlets and items are selected using probabilities proportional to the turnover

Selection list for the outlets (local units) is obtained from business register, ASIA-UL ASIA-PV The archive contains information useful for the selection and

the definition of the inclusion probability For the item level different approaches will coexist at the

beginning: Scanner data for food and grocery in the modern distribution

allow the use of sampling methods and index compilation using weights from quantities (or expenditures)

For traditional distribution and the other sectors, data collection and index compilation will continue unchanged at first



Among the first analyses on the scanner data universe we carried out some tentative experiments of sample selection of series

Data referred to the first Italian provinces for which ISTAT got data, for year 2014

Weekly data on turnover and quantities per EAN-code (GTIN) and outlet allow obtaining weekly unit values

One series is considered present in a specific month if it has associated a non zero turnover in at least one of the three central weeks of the month

The first issue we analysed is the continuity of series (=EAN by outlet) and the coverage of a panel of permanent series Following figures show examples of the coverage rate of

a fixed basket of series taken in December 2013 during 2014 months

3. The context of the first experiments of sampling from Scanner Data


Figure 1a. Presence of series fixed in Dec2013 in 2014 months - Coverage of single series and total turnover (all products, Turin province)

jan feb mar apr may jun jul ago sep oct nov dec0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

85.1983.02 81.62

79.3277.54 76.23

73.8971.94

73.46 73.58 73.53 74.22

93.3291.68

89.36

81.11

85.4183.16 82.41 81.33 81.77 82.75 82.29

80.49

Coverage of items (%) Total sales coverage (%)


Figure 1b. Presence of series fixed in Dec2013 in 2014 months - Coverage of single series and total turnover (Coicop 6 digits - Coffee segment, Turin province)

jan feb mar apr may jun jul ago sep oct nov dec0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

91.56 90.23 89.14 87.78 87.12 86.38 85.0583.09

85.02 84.50 85.26 86.10

94.63 95.9792.89

96.02

90.8192.58

90.73 92.00

88.14 89.0785.32

91.75

Coverage of items (%) Total sales coverage (%)


Very first exercise on permanent series

The permanent series are defined as having non-zero turnover for at least one relevant week (one of the first three full weeks) every month for 13 months (Dec 2013-Dec 2014)

After having verified that permanent series provide a good coverage of the total turnover For these first analyses we focus on this PANEL,

postponing the issues related to the item life cycle, replacement of missing items and seasonality

Our reference universe for the first experiments is constituted only of permanent series and indices are evaluated on this sub-set of series for each consumption segment

3. The context of the first experiments of sampling from SD


Table 1. Total turnover for all series, relevant week series and panel series, five Italian provinces (2014)


Province

TURNOVER % COVERAGE NUMBER OF PANEL SERIES All

Series (A)

Relevant weeks all series

(B)

Relevant weeks Panel Series

(C)

B/A C/B

Turin 81.250.067 56.074.338 46.175.718 69,0 82,4 40.234

Ancona 22.847.504 15.988.337 12.487.957 70,0 78,1 16.516

Cagliari 18.308.833 12.726.186 9.598.138 69,5 75,4 9.165

Palermo 18.374.304 12.711.236 8.531.003 69,2 67,1 9.375

Piacenza 11.139.258 7.771.635 6.727.649 69,8 86,6 6.737

For the outlets of Turin for which we have data, we focus on three consumption segments (Coicop 6 digits):

Coffee (01.2.1.1.0) Pasta (01.1.1.6.1) Mineral water (01.2.2.1.0)

3. The context of the first experiments of sampling from SD

Table 2. Total turnover for all series, relevant week series and panel series, 3 segments in Turin (2014)


Consumption

segment

TURNOVER % COVERAGE NUMBER OF PANEL SERIES

All Series

(A)

Relevant weeks all

series (B)

Relevant weeks Panel Series

(C)

B/A C/B

Coffee 28.622.978 19.665.517 15.692.414 68,7 79,8 9.608

Pasta 26.192.517 17.902.061 13.631.744 68,4 76,2 23.636

Mineral water 26.434.572 18.506.760 16.851.559 70,0 91,1 6.990

Comparison of probabilistic and non-probabilistic sampling selection schemes for different aggregation index formula for elementary aggregates

Cut-off selection of series based on thresholds of covered turnover: the index is compiled using all series covering 60% or 80% of all turnover in each outlet for the consumption segment, in previous year 2013

Probability sampling: pps (size= previous year turnover) for two sampling rates (5% and 10%), selection of 500 samples

Reference method: most sold items in each outlet for representative products (current fixed basket approach)

4. The selection of elementary items from SD



Table 3. Percentage and average number of items per outlet covering 60 and 80% of turnover, 3 segments in Turin (2014)

Consumption segment

Percentage of Series Average Number of Items per Outlet

Turnover threshold

60%

Turnover threshold

80%

Total

Covering60% of

turnover

Covering 80% of

turnover

Coffee 16,2 36,1 46 8 17

Pasta 23,4 44,8 114 27 51

Mineral water

12,1 26,3 34 4 9

Sample series are selected from a sample of outlets: 30 out of 127 of outlets of retail trade modern distribution in Turin province

Outlet are selected by stratified SRS sampling with allocation proportional to turnover of strata (6 chain by 2 types of outlet)

For each sample we compiled the elementary fixed base indices for 12 months with three classical aggregation formulas: Jevons (unweighted), Fisher (ideal) and Lowe (weights from quantities of previous year)

For the sample selection and weighting of indices we refer to total annual turnover

Comparison of each estimate with the corresponding universe value, evaluated on the complete set of panel series

4. The selection of elementary items from SD


For each aggregation formula comparison of values obtained on the different subset of series

Bias

Variance

5. First results


Figure 2a. Lowe Indices for elementary aggregate : comparison of universe and different sub-sets (Coffee -Turin 2014)

The mean of the sample estimates is perfectly overlapped to the “true” value U

Cut-off samples over-estimate but follow trend

Most sold items under-estimates and alter trend


Dec Gen Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec90

95

100

105

110

115

Lowe - Coffee

Lowe_U

Lowe_s80

Lowe_s60

Lowe_1

Lowe_s

Figure 2b. Fisher Indices for elementary aggregate : comparison of universe and different sub-sets (Coffee -Turin 2014)

The mean of the sample estimates is quite overlapped to the “true” value U

Cut-off samples over-estimate but follow the trend

Most sold items under-estimate and alter trend



95

100

105

110

115

Fisher - Coffee

Fish_U

Fish_s80

Fish_s60

Fish_1

Fish_s

Figure 2c. Jevons Indices for elementary aggregate : comparison of universe and different sub-sets (Coffee -Turin 2014)

The mean of the sample estimates strongly over-estimates the “true” value U and stresses trend

Cut-off samples over-estimate but follow quite the trend

Most sold items substancially follow trend and levels



95

100

105

110

115Jevons - Coffee

Jevo_U

Jevo_s80

Jevo_s60

Jevo_1

Jevo_s

Figure 3. Lowe Indices for elementary aggregate : comparison of universe and different sub-sets (Coffee and Pasta -Turin 2014)

Pasta: The mean of the sample estimates and cut-off values are overlapped to the “true” value U

Most sold items over-estimates and alter trend

Cut-off values and best selling items show opposite performance for the two product: this can be explained by the different number of items and turnover distributions?



95

100

105

110

115 Coffee

Lowe_ULowe_s80Lowe_s60Lowe_1Lowe_s


95

100

105

110

115 Pasta

Figure 4. Fisher and Jevons Indices for elementary aggregate : comparison of universe and different sub-sets (Pasta -Turin 2014)

The mean of the sample estimates is quite overlapped to the “true” value U

Cut-off samples slightly under-estimate but follow the trend

Most sold items stress trend



95

100

105

110

115Fisher

Fish_U

Fish_s80

Fish_s60

Fish_1

Fish_s


95

100

105

110

115Jevons

The mean of the sample estimates strongly over-estimates the “true” value U and accentuates trend

Cut-off samples over-estimate but quite follow trend

Most sold items substancially follow trend and levels

Figure 5a. Confidence band for Lowe Indices for elementary aggregate in comparison with universe values, pps sample (Coffee Turin 2014)



95

100

105

110

115

Lowe - Coffee pps f=5%

Real value

pps f=5%

UB

LB


95

100

105

110

115

Lowe - Coffee pps f=10%

Real value

pps f=10%

UB

LB

Figure 5b. Confidence band for Jevons Indices for elementary aggregate of Coffee in comparison with of universe values, pps sample (Turin 2014)



95

100

105

110

115

Jevons - Coffee pps 5%

Real value

pps f=5%

UB

LB


95

100

105

110

115

Jevons - Coffee pps 10%

Real value

pps f=10%

UB

LB

Table 4. Bias and Relative Sampling Error distribution of monthly Lowe, Fisher and Jevons indices for pps samples of series

Bias

Consumption Segment

Sampling rate

Lowe Index Fisher Index Jevons Index

min max min max min max

Coffee5% -0.07 0.12 -0.45 0.10 1.87 5.88

10% -0.02 0.04 -0.17 0.06 0.73 3.51

Pasta5% -0.13 0.03 -0.25 0.23 -2.26 0.03

10% -0.05 0.03 -0.06 0.08 -2.43 0.12

Sampling relative error (%)

Consumption Segment

Sampling rate

Sample sizeLowe Index Fisher Index Jevons Index

min max min max min max

Coffee5% 190 1.17 1.36 4.73 5.65 0.90 1.20

10% 380 0.65 0.91 2.27 2.79 0.43 0.58

Pasta5% 450 1.03 1.29 3.99 4.87 0.91 1.19

10% 900 0.65 0.93 2.81 3.60 0.51 0.70


The results highlight the following first evidences about the performance of different series selection schemes

Cut-off based sample are much less biased with respect to a selection of most sold items: in general cut-off slightly overestimate while the most sold items approach underestimate inflation (even inflation vs deflation); anyway the results depend on the product category

Probability pps sample produces approximately unbiased estimates for indices using weights (Lowe and Fisher), though the second one shows a very high variance.

Probability srs sample produces approximately unbiased estimates when using unweighted indices (Jevons)

Sample scheme is not neutral with respect to index choice Increasing the sampling rate produce a remarkable improvement of

the bias in all indices, in addition to an obvious reduction of sampling error

First replication on other product segments show similar evidence but depending on the distribution of turnover


The sample allocation criteria are still under study, both for outlet and items; analysis of variability will be made

In-depth studies should take into consideration the cycle of life of items and all related implications (new items, replacements…)

A big issue to deal with is the integration between scanner data and traditional data for index compilation, different hypothesis are under evaluation

Combining indices obtained with different approaches Gradually abandon manual collection, at least for food and

grocery, considering the high expenditure coverage of modern distribution

Aim at define and realise a probability sampling also for traditional distribution

6. Open issues and conclusions


Thank you for your attention !


Sampling design issues in Italian experience on scanner data and the possible integration with...

Documents

Transcript of Sampling design issues in Italian experience on scanner data and the possible integration with...