Next: Functions and Variables for descriptive statistics, Previous: Introduction to descriptive, Nach oben: Package descriptive [Inhalt][Index]
Builds a sample from a table of absolute frequencies. The input table can be a matrix or a list of lists, all of them of equal size. The number of columns or the length of the lists must be greater than 1. The last element of each row or list is interpreted as the absolute frequency. The output is always a sample in matrix form.
Examples:
Univariate frequency table.
(%i1) load ("descriptive")$
(%i2) sam1: build_sample([[6,1], [j,2], [2,1]]);
                       [ 6 ]
                       [   ]
                       [ j ]
(%o2)                  [   ]
                       [ j ]
                       [   ]
                       [ 2 ]
(%i3) mean(sam1);
                      2 j + 8
(%o3)                [-------]
                         4
(%i4) barsplot(sam1) $
Multivariate frequency table.
(%i1) load ("descriptive")$
(%i2) sam2: build_sample([[6,3,1], [5,6,2], [u,2,1],[6,8,2]]) ;
                           [ 6  3 ]
                           [      ]
                           [ 5  6 ]
                           [      ]
                           [ 5  6 ]
(%o2)                      [      ]
                           [ u  2 ]
                           [      ]
                           [ 6  8 ]
                           [      ]
                           [ 6  8 ]
(%i3) cov(sam2);
       [   2                 2                            ]
       [  u  + 158   (u + 28)     2 u + 174   11 (u + 28) ]
       [  -------- - ---------    --------- - ----------- ]
(%o3)  [     6          36            6           12      ]
       [                                                  ]
       [ 2 u + 174   11 (u + 28)            21            ]
       [ --------- - -----------            --            ]
       [     6           12                 4             ]
(%i4) barsplot(sam2, grouping=stacked) $
The argument of continuous_freq must be a list of numbers.
Divides the range in intervals and counts how many values
are inside them.  The second argument is optional and either
equals the number of classes we want, 10 by default, or
equals a list containing the class limits and the number of
classes we want, or a list containing only the limits.
Argument list must be a list of (2 or 3) real numbers.
If sample values are all equal, this function returns only
one class of amplitude 2.
Examples:
Optional argument indicates the number of classes we want.
The first list in the output contains the interval limits, and
the second the corresponding counts: there are 16 digits inside 
the interval [0, 1.8], 24 digits in (1.8, 3.6], and so on.
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) continuous_freq (s1, 5);
(%o3) [[0, 1.8, 3.6, 5.4, 7.2, 9.0], [16, 24, 18, 17, 25]]
Optional argument indicates we want 7 classes with limits
-2 and 12:
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) continuous_freq (s1, [-2,12,7]);
(%o3) [[- 2, 0, 2, 4, 6, 8, 10, 12], [8, 20, 22, 17, 20, 13, 0]]
Optional argument indicates we want the default number of classes with limits
-2 and 12:
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) continuous_freq (s1, [-2,12]);
                3  4  11  18     32  39  46  53
(%o3)  [[- 2, - -, -, --, --, 5, --, --, --, --, 12], 
                5  5  5   5      5   5   5   5
               [0, 8, 20, 12, 18, 9, 8, 25, 0, 0]]
Counts absolute frequencies in discrete samples, both numeric and categorical. Its unique argument is a list,
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) discrete_freq (s1);
(%o3) [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 
                             [8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]
The first list gives the sample values and the second their absolute 
frequencies.  Commands ? col and ? transpose should help you to 
understand the last input.
This is a sort of variant of the Maxima submatrix function.  The first 
argument is the data matrix, the second is a predicate function and optional 
additional arguments are the numbers of the columns to be taken.  Its behaviour 
is better understood with examples.
These are multivariate records in which the wind speed in the first 
meteorological station were greater than 18.  See that in the lambda expression 
the i-th component is refered to as v[i].
(%i1) load ("descriptive")$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) subsample (s2, lambda([v], v[1] > 18));
              [ 19.38  15.37  15.12  23.09  25.25 ]
              [                                   ]
              [ 18.29  18.66  19.08  26.08  27.63 ]
(%o3)         [                                   ]
              [ 20.25  21.46  19.95  27.71  23.38 ]
              [                                   ]
              [ 18.79  18.96  14.46  26.38  21.84 ]
In the following example, we request only the first, second and fifth components of those records with wind speeds greater or equal than 16 in station number 1 and less than 25 knots in station number 4. The sample contains only data from stations 1, 2 and 5. In this case, the predicate function is defined as an ordinary Maxima function.
(%i1) load ("descriptive")$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) g(x):= x[1] >= 16 and x[4] < 25$
(%i4) subsample (s2, g, 1, 2, 5);
                     [ 19.38  15.37  25.25 ]
                     [                     ]
                     [ 17.33  14.67  19.58 ]
(%o4)                [                     ]
                     [ 16.92  13.21  21.21 ]
                     [                     ]
                     [ 17.25  18.46  23.87 ]
Here is an example with the categorical variables of biomed.data.
We want the records corresponding to those patients in group B
who are older than 38 years.
(%i1) load ("descriptive")$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) h(u):= u[1] = B and u[2] > 38 $
(%i4) subsample (s3, h);
                [ B  39  28.0  102.3  17.1  146 ]
                [                               ]
                [ B  39  21.0  92.4   10.3  197 ]
                [                               ]
                [ B  39  23.0  111.5  10.0  133 ]
                [                               ]
                [ B  39  26.0  92.6   12.3  196 ]
(%o4)           [                               ]
                [ B  39  25.0  98.7   10.0  174 ]
                [                               ]
                [ B  39  21.0  93.2   5.9   181 ]
                [                               ]
                [ B  39  18.0  95.0   11.3  66  ]
                [                               ]
                [ B  39  39.0  88.5   7.6   168 ]
Probably, the statistical analysis will involve only the blood measures,
(%i1) load ("descriptive")$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) subsample (s3, lambda([v], v[1] = B and v[2] > 38),
                 3, 4, 5, 6);
                   [ 28.0  102.3  17.1  146 ]
                   [                        ]
                   [ 21.0  92.4   10.3  197 ]
                   [                        ]
                   [ 23.0  111.5  10.0  133 ]
                   [                        ]
                   [ 26.0  92.6   12.3  196 ]
(%o3)              [                        ]
                   [ 25.0  98.7   10.0  174 ]
                   [                        ]
                   [ 21.0  93.2   5.9   181 ]
                   [                        ]
                   [ 18.0  95.0   11.3  66  ]
                   [                        ]
                   [ 39.0  88.5   7.6   168 ]
This is the multivariate mean of s3,
(%i1) load ("descriptive")$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) mean (s3);
       65 B + 35 A  317          6 NA + 8144.999999999999
(%o3) [-----------, ---, 87.178, ------------------------, 
           100      10                     100
                                                    3 NA + 19587
                                            18.123, ------------]
                                                        100
Here, the first component is meaningless, since A and B are 
categorical, the second component is the mean age of individuals in rational 
form, and the fourth and last values exhibit some strange behaviour.  This is 
because symbol NA is used here to indicate non available data, 
and the two means are nonsense.  A possible solution would be to take out from 
the matrix those rows with NA symbols, although this deserves some 
loss of information.
(%i1) load ("descriptive")$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) g(v):= v[4] # NA and v[6] # NA $
(%i4) mean (subsample (s3, g, 3, 4, 5, 6));
(%o4) [79.4923076923077, 86.2032967032967, 16.93186813186813, 
                                                            2514
                                                            ----]
                                                             13
Next: Functions and Variables for descriptive statistics, Previous: Introduction to descriptive, Nach oben: Package descriptive [Inhalt][Index]