Knowledge discovery in rubber extrusion processes€¦ · a rubber extrusion process, focusing on...

Knowledge discovery in rubber extrusion processes

CASTEJÓN LIMAS M.; ORDIERES MERÉ J.B.(*); ALBA ELÍAS F.; MARTÍNEZ DE PISÓN ASCACIBAR F.J.

Department of Mechanical Engineering University of La Rioja

c./Luis de Ulloa s/n, 26004 – Logroño (LA RIOJA) SPAIN

Abstract: - This paper describes the outcomes of a study that the EDMANS(**) group has recently performed in a rubber extrusion process, focusing on the knowledge discovery phase previous to the system modeling. Some of the tools developed to satisfy the special needs of such a process are also presented: the CiTree algorithm for clustering subpopulations in massive databases and the PAELLA algorithm for outlier detection and data cleaning in non normal samples like those typically found in industrial processes. Finally, the results obtained by these data mining techniques when applied to a real rubber extrusion databases are shown. Key-Words: - CiTree, cluster analysis, PAELLA, outlier detection, rubber, extrusion 1 Introduction 1.1 An approach to the context A modern approach [1] to the quality improvement of an industrial process must consider the entire set of productive processes as a global single system. This rule is generally accepted as a motto by factory managers. Cost reduction and quality improvement must be tackled during the whole chain of manufacturing processes in order to obtain measurable results at the final product. This is true as well for rubber manufacturing, where the product composition or the particular techniques applied during the blending phase[2] must be carefully analyzed[3]--[7] to determine the eventual effects not only over the final product but also over the extrusion process itself. After several previous similar research projects1, we now undertake the challenge of finding the features that the extrusion process has in common with other similar previous experiences.

The primary goals of this research are the prediction of the mechanical properties and geometrical features of the final product by means of the analysis of raw process data recorded during normal factory operation. The relevance of such prediction is remarkably significant in the polymer

1 FEDER 2FD97-1575 research project on the extrusion shape geometry analysis and the DPI2001-1408 research project, where we commenced the analysis of the quality of the mixtures by means of data mining techniques

industrial sector; it should allow the factory managers to predict the product quality before[7] and after it is manufactured, thus acting like a predictive software sensor[8]. Our research will eventually lead to provide tools[10]--[15] that would help to dynamically control the fundamental variables that define the global status of the production processes.

Strategically, the cooperation of artificial intelligence and modern control techniques should help the factory to enjoy the competitive advantages offered by the rubber wastage cut down.

In this paper, we focus on the analysis of the database related to the extrusion process, as the extrusion is considered by the manufacturer as the key process of their productive system. This relevance is justified by the simple consideration of the sundry processes that follow the extrusion: coating, flocking, sticking, marking, drilling, etc.; and the large number of parameters that must be tuned in order to setup the extrusion process, which make such adjustment a complex, critical and very expertise dependent task. In order to avoid the need of the previous knowledge, we will eventually develop a control system that will allow the managers of the factory to obtain a more profound knowledge about the processes behavior and the consequences of their decisions.

Proceedings of the 8th WSEAS Int. Conference on Automatic Control, Modeling and Simulation, Prague, Czech Republic, March 12-14, 2006 (pp201-206)

1.2 The data set approach to quality improvement Knowing a process from its recorded data yields advantages of great interest. The number of firms that seek and find solutions to their productive problems by means of the analysis of their production data is increasing everyday. Current technology allows to routinely store in databases the control variables of special interest and command history of the processes. A later analysis of these databases provides a potentially precious source of high quality information of great help in decision making. That is precisely our approach.

We show here the first results obtained after the analysis of the data set from a rubber extrusion process. The data set consists of 251,144 measurements over 30 variables which basically contain the temperature, pressure and velocity of the process at different points of the extrusion process. It contains missing data and loads of negligible observations recorded during irregular maintenance stops.

2 Data Analysis 2.1 Stream preprocessing The dataset contains the records from the running process registered every minute during approximately nine months. During those months the observations were registered both during the productive operations but also during maintenance stops. The first preprocessing task that should be accomplished would precisely be the splitting of the data set into two sets: one with those samples from maintenance stops and other with samples from running periods. In order to distinguish amongst these two preliminary clusters, we find convenient to download the dataset from the database to the processing computer to accomplish the splitting by means of the more flexible algorithms that can be performed out of the SQL language of the database. In particular, we are inclined to use the R environment2 for our calculations.

Figure 1 shows one of the sequences of normal productive operation once it has been isolated from the main dataset. We collected about 338 sequences

2 The interested user may find the software and documentation of the R language and environment at http://www.r-project.org

of measurements taken during the productive processes.

A second refinement of the data set must be to distinguish between the transient periods from the permanent performance. This can also be conveniently performed into the R environment just defining simple rules.

Figure 1. Operation sequence block

After having refined the raw dataset to the point

that we have isolated the stable system states, we still have more than three hundred different sequences. What we should next do is to obtain the characteristics of each of these sequences in order to establish the number of different states that we can find in the dataset. 2.2 Cluster Analysis In order to characterize the nature of these complex elements we have previously isolated, we are forced to use a new algorithm, like the CiTree(***) hierarchical clustering algorithm. This algorithm3 yields a hierarchical structure of the clusters present in the process, thus providing a detailed

3 We have already successfully applied this

algorithm to a variety of industrial processes, as well as to data sets of different nature whose origin rested in the medicine, biology and epidemiology fields.


representation of the relationships amongst sample units.

Unlike other agglomerative hierarchical clustering algorithms, CiTree is capable of coping with complex starting nodes. This enables the practitioner to analyze only the top few branches of the hierarchy, precisely those with information of greater value. Our starting nodes will be the representatives of the permanent operation sequence. We consider these sequences as spawn from a collection of subpopulations whose distribution belongs to the multivariate normal distribution.

We first need to characterize the parameters

of these subpopulations, which can be easily done by means of identifying the distribution of the sample mean vectors. Then, in order to build the hierarchy, we need to define the dissimilarity function that will measure the differences among nodes. We rely on the likelihood ratio statistic for that purpose. We show in Figure 2 the hierarchy of nested relationships that are identified by the CiTree algorithm.

Figure 2. CiTree Hierarchical cluster analysis

As the picture of the tree suggest, the practitioner may clearly see two main clusters, but also four or even more are possible, depending of their subjective point of view. In order to define the more proper configuration, some cluster validation must be performed.

2.3 Cluster Validation In order to evaluate the quality of the different possible configurations that the CiTree hierarchy provides, we rely on the Fowlkes-Mallows [16] index. We show in Figure 3 the evolution of the index through the top branches, analyzing the clustering results provided by cutting the CiTree between two and ten clusters. These clustering results are compared with those provided by the quadratic and linear discriminant analysis. The philosophy beneath these comparisons is the understanding that those clusters that can be correctly learnt and reproduced by the discriminant methods may had been more properly characterized.

Figure 3. Fowlkes-Mallows index

Q: Fowlkes-Mallows index of the Quadratic discriminant analysis.

L: Fowlkes-Mallows index of the Linear discriminant analysis.

From Figure 3, the practitioner may understand that these top branches correctly collect the existent clusters that support the dataset. There is none, or only a slight difference among the results obtained by the quadratic discriminant analysis. In this case, following the Occam’s razor approach, we would be inclined to choose two clusters. Following that approach, we would be able to obtain the projection that we show in Figure 4, where the LDA results are provided. As it can be seen from that picture, a clear separation is obtained between the two classes.


Figure 4. LDA projection of the two main clusters.

Nevertheless, according to the LDA Fowlkes-Mallows index, the clustering results significantly improve at four clusters. We show in Figure 5 the LDA projection of the samples at a four clusters cut of the CiTree.

Figure 5. LDA projection of the main four clusters.

As it can be seen, no significant improvements are gained by increasing the number of clusters. Thus we consider that there exist two principal behaviors within the registered dataset. Fortunately, Figure 5

remarks the existence of long tails in the dataset with samples that may be considered outliers. Thus, we proceed applying our PAELLA algorithm for outlier detection and data cleaning in non normal samples. 2.3 Outlier Detection In order to identify the outlying samples we run the PAELLA algorithm for 20,000 iterations. This identification will allow us not only to clean the data set of outliers, but also will help in the identification of the possible causes that originated those strange behaviors. The identification just took about 4 minutes in a domestic computer4. Figure 6 shows the results given by PAELLA. The outlying samples are marked with red crosses. Even though they seem to be a big fraction at the picture, they represent less than one percent of the samples.

Figure 6. PAELLA algorithm results. Outliers marked with red crosses, regular points marked with blue dots.

It is interesting to remark the integration of both algorithms, CiTree and PAELLA, since the latter uses the results of the former, and the former provides an enhanced performance when compared to other competitors.

4 The time was recorded on a regular laptop equipped with a 1Ghz AMD64 CPU and 1Gb RAM memory.


Figure 7 shows the dataset once the outliers have been removed. It is clear that from this cleaner data set those results from even robust modeling tools will be more reliable as they will reflect the behavior of the vast majority of the data and those outlying samples will have no effect on the parameter estimates.

Figure 7. Dataset after been cleaned by the PAELLA algorithm

3 Conclusions In this paper we have discussed a new approach, to the exploration of a large data set from a rubber factory. In particular, we have shown how to apply hierarchical agglomerative clustering dendrograms when the data comes packed in sequences of common states. Our approach was to obtain the limits of these sequences to later on consider them as the result of a preclustering phase. From these composite elements, using them as ‘atoms’, we built a hierarchical agglomerative tree using a natural dissimilarity measure. This hierarchy revealed interesting structures. This clustering algorithm, named CiTree, is a hierarchical agglomerative algorithm in which the construction of the lower branches of the hierarchical tree is replaced by a basic and fast non-hierarchical algorithm. Indeed, in most cases the bottom branches of a hierarchical tree are not as useful as the top branches in providing sense and interpretation. Therefore it seems reasonable to short-circuit the early and most computationally heavy phase of

dendrogram construction, sacrificing some less significant details; instead, complex computations should be concentrated on the later phase, where important information is more likely to be revealed.

The suggestions gleaned from visual interpretation of the dendrogram were confirmed by numerical analysis based on the Fowlkes-Mallows index, thus showing the power of the visual representation provided by the dendrogram. These results, useful in themselves, can also provide a deeper insight into the structure of the data if used in synergy with other data mining techniques.

Finally, we showed an example of outlier identification by means of the PAELLA algorithm, where the CiTree algorithm accelerated the bottleneck already found in non-normal outlier identification. The identification of the outliers provided a cleaner data set from which to proceed with the following phases of the modeling. References: [1] Castejón Limas, M; Ordieres Meré, J.B.; de Cos

Juez, F.J.; Martínez de Pisón Ascacibar, F.J. “Control de Calidad. Metodología para el análisis previo de los datos en procesos industriales. Fundamentos teóricos y aplicaciones en R.” Servicio de Publicaciones de la Universidad de La Rioja. 2001. ISBN: 84-95301-48-2

[2] Hou, L., Nassehi, V., Evaluation of stress-effective flow in rubber mixing. Third World Congress of Nonlinear Analysis, 47, 2001, pp. 1809-1820.

[3] El-Nashar, D.E., Turky, G., Effect of mixing conditions and chemical cross-linking agents on the physicomechanical and electrical properties of NR/NBR blends. Polymer-plastics Technology and Engineering, vol. 42, nº 2, 2003, pp. 269-284.

[4] Zhang, A., Wang, L., Zhou, Y., A study on rheological properties of carbon black extended powdered SBR using a torque rheometer. Polymer Testing, 22, 2003, pp. 133-141.

[5] Koolhiran, A., White, J.L., Comparasion of intermeshing rotor and traditional rotors of internal mixers in dispersing silica and other fillers. Journal of Applied Polymer Science, vol. 78, nº 8, 2000, pp. 1551-1554.


[6] Freakley, P.K. and Fletcher, J.B., The single-rotor continuous mixing system. Rubber World, vol. 226, nº 4, 2002, pp. 28-31.

[7] Schwartz, G.A., Prediction of rheometric properties of compounds by using artificial neural networks. Rubber Chemistry and Technology, vol. 74, nº 1, 2001, pp. 116-123.

[8] Merikoski, S., Laurikkala, M., Koivisto, H., An adaptive neuro-fuzzy inference system as a soft sensor for viscosity in rubber mixing process. In N. Mastorakis (ed), Advances in neural networks and applications (Greece: World Scientific and Engineering Society Press), 2001, pp. 287-291.

[9] Sombatsompop, N., and Dangtangee, R., Effects of the actual diameters and diameter ratios of barrels and dies on the elastic swell and entrance pressure drop of natural rubber in capillary die flow. Journal of Applied Polymer Science, vol. 86, nº 7, 2002, pp. 1762-1772.

[10] Castejón Limas, M.; Ordieres Meré, J.B.; Martínez de Pisón Ascacibar, F.J.; Vergara González , E.P.; “Outlier detection and data cleaning in multivariate non-normal simples. The PAELLA algorithm.” Data Mining and Knowledge Discovery, Vol. 9, 2004, pp. 171-187.

[11] Castejón Limas, M., “Desarrollo de estrategias basadas en técnicas de inteligencia artificial para la mejora de la calidad en procesos industrials”. PhD Thesis. Universidad de La Rioja, 2004.

[12] Ciampi, A; Lechevallier, Y.; Castejón Limas, M.; González Marcos, A. “Hierarchical Clustering of Sub-Populations with a dissimilarity based on the likelihood ratio statistic: Application to Clustering Massive Data Sets.” Under revision.

[13] Ordieres Meré, J.B, González Marcos, A., González, J.A., Lobato Rubio, V., “Estimation of mechanical properties of steel strip in hot dip galvanizing lines”. Ironmaking & Steelmaking, Vol. 31, nº 1, 2004, pp. 43-50.

[14] Pernía Espinoza, A.V., Ordieres Meré, J.B., Martínez de Pisón, F.J., González Marcos, A., TAO-robust backpropagation learning algorithm, Neural Networks, vol. 18, 205, pp. 191-204.

[15] Ordieres, J., López, L.M., Bello, A., and Forcada, A., Intelligent methods helping the design of a manufacturing system for die extrusion rubbers. International Journal of Computer Integrated Manufacturing, vol. 16, nº 3, 2003, pp. 173-180.

[16] Fowlkes, E., Mallows, C., “A new method for comparing two hierarchical clusterings.” Journal of the American Statistical Association, Vol. 78, 1983, pp. 553-569.

NOTES: (*) Joaquín B. Ordieres Meré, Francisco Javier Martínez de Pisón Ascacibar and Fernando Alba Elías work at the Universidad de La Rioja, Departamento de Ingeniería Mecánica, Área de Proyectos de Ingeniería. (**) The Engineering Data Mining and Numerical Simulation research group joins the efforts of researchers at the Universidad de La Rioja and the Universidad de León (***) Antonio Ciampi – McGill University – and Yves Lechevallier – INRIA – in collaboration with their Spanish colleagues developed the CiTree algorithm used in this paper. Aknowledgments: We gratefully acknowledge support from the Ministerio de Educación y Ciencia de España, Dirección General de Investigación, by means of the DPI2004-07264-C02-01 research contract; from the “II Plan Riojano de I+D+i”; from the European Union by means of the CEUTIC INTERREG IIIA Spain / France trans-border cooperation project; and from the RFCS program by means of the RFS-CR-03012, RFS-CR-04023 and RFS-CR-04043.


Knowledge discovery in rubber extrusion processes€¦ · a rubber extrusion process, focusing on...

Documents

Transcript of Knowledge discovery in rubber extrusion processes€¦ · a rubber extrusion process, focusing on...