I always try to work on both theoretical statistics (to develope useful
methods) and applied statistics
(to apply theoretical methods in real problems and to gain motivations
for enriching the statistical
theory). I am breaking my researchs according to the areas of my research
interests.
The numbers in the brackets refer to the numerical labels of my research
reports and
publications in the list attached to this summary.
1. Goodness-of-Fit Statistics Based on Kernel Density Estimates.
Bickel and Rosenblatt (1973, Ann. Statist.) introduced a goodness-
of-fit statistic based on kernel density estimate. They developed the
asymptotic distribution of the proposed statistic under both null
hypothesis and local function alternatives. The function local
alternatives are generalizations of Pitman alternatives. The problem
of choosing the "best" possible kernel was not studied by Bickel
and Rosenblatt. B.K. Ghosh and I studied this problem for a while
and we finally solved the problem and part of the results was
published in the Annals of Statistics [4]. The optimal choice of the
kernel is in terms of maximizing the local asymptotic power of the
test. The result is kind of surprising and unexpected. The
standard quadratic kernel ( the well-known Epanechnikov kernel in
density estimation) does not maximize the local asymptotic power.
In fact the simplest kernel, the uniform kernel, is the one which
maximizes the local asymptotic power of the Bickel-Rosenblatt test.
It implies that the moving histogram should be used in the kernel-
based goodness-of-statistic. The proof uses the Fourier
transformation of the target functional and then applies Parseval's
identity to the transformed functional. Standard variational
method using Gateaux differential and convexity argument then
used to obtain the optimal solution of the target functional. Beran
(1977, Ann. Statist.) proposed a goodness-of-fit test based on
Hellinger distance and the power of his test is also a decreasing
function of the target functional in our study. The resulting kernel
in [4] should also be a good choice for Beran's test as well. Using
Monte-Carlo method, Ghosh and I ([5]) also compared the power of
Bickel-Rosenblatt's test (for sample size n = 40) with those of
several well-known tests based on empirical distribution. In [10]
and [12] I studied nonparametric likelihood ratio and adaptive
tests and investigated the sampling distributions via Monte-Carlo
method. I found that the proposed likelihood ratio statistic is
closely related to the Bickel-Rosenblatt statictic. From the
simulation study I realized that the first order approximation (used
in Bickel and Rosenblatt (1973)) can be improved. The normal
approximation applied to the log-transformated statistic works
better than to the one without the log-transformation ([12] , [24]).
2. Estimation of Discontinuous Density.
Rosenblatt and Parzen proposed the well-known kernel estimator for
an
unknown continuous density function. Schuster (1985, Comm. in Statist.)
suggested a modification, the reflected version of the kernel
estimator, to rectify certain drawbacks of the kernel estimators
when the unknown density has some discontinuous points. The
optimal choice (in terms of minimizing the Integrated Mean Squares
Error (IMSE)) of the kernel for the Rosenblatt-Rosenblatt's
estimators is known to be the quadratic kernel obtained by
Epanchnikov (1969). But the kernel which minimizes the IMSE for
the folded kernel estimators was not before 1991. It is a rather
difficult problem. Ghosh and I worked on this problem for a period
of time. In fact, we didn't not know the reflected kernel estimator
proposed by Schuster when we submitted the work for publication.
We were looking at it totally from "folding" the ordinary kernel
estimator with respect to points of discontinuity. We found the
optimal kernel for this problem and it was included in [3] and [28]
along with some other related problems of rates of IMSE and
asymptotic relative efficiency. Some of the results in [3] were
treated in great details in [28]. Two defferent definitions of IMSE's
were used in the study and lead to some interesting results. Part of
the results also supplement earlier related results of van Eeden
(1985, Ann. Inst. Statist. Math.) and Cline and Hart (1990,
Statistics). The new optimal kernel has never been used in density
estimation and it appers somewhat unusual but the uniform (which
was described in the previous section in the Bickel-Rosenblatt's
goodness-of-fit test) and Epanechnikov (the optimal kernel in
estimating a continuous density) kernels can, in fact, be thought of
first and second order approximations to the new kernel. As an
application, the folded kernel estimator could be used in the Bickel-
Rosenblatt goodness-of-fit statistic since the uniform density, the
density under the null hypothesis, is discontinuous at the two
endpoints. In fact, it improves the normal approximation used by
Bickel and Rosenblatt (1973). Some of the results are included in
[24] and in a dissertation of one of my Ph.D. students at Lehigh.
3. Semiparametric Modeling.
A semiparametric model is a statistical model consisting of both parametric
and nonparametric components. Many practical problems in statistics
can be
modeled in terms of a semiparametric model with suitable choice of
both
components. We could also set up a semiparametric model such
that it includes a proposed parametric model with the freedom that
we don't have too specific about some of assumptions in the
proposed parametric model. In a sense we are hoping to get a more
"robust" model. But one of the consequences of "enlarging" the
parameter space is that it is rather difficult to evaluate the
efficiencies of statistical procedures. The problem of evaluating the
efficiency of an estimator in a purely parametric model or a purely
nonparametric model has been well-developed. But the efficiency
problem for a mixed parametric-nonparametric model is relatively
new. Under the supervision of Professor W.J. Hall and Professor
Jon A. Wellner , I worked on semiparametric problems for my Ph.D.
dissertation. Begun, Hall, Wellner and I ([16]) developed some
results in evaluating the efficiency of estimators in general
semiparametric models. The work was motivated by the pioneer
works of Stein (1956, Proc. Third Berkeley Symp.), Bickel (1982,
Ann. Statist), LeCam (1972, Proc. Sixth Berkeley Symp.) and others.
The work leads to information-type of bounds for estimation of the
parameters of semiparametric models and it also leads to methods
of constructing effective estimators based on "effective scores".
Some methods (motivated by works of Bickel (1982) and the general
score function method by C.R. Rao) were proposed in [33] and [34]
and have applied to some specific problems in [14] and [31]. The
problem of nonparametric estimation of the cumulative distribution
function under random truncation was treated as a probelm in
semiparametric model in [7]. Tsai and I established a convolution-
type representation theorem and a local asymptotic minimax
theorem to show the asymptotic efficiency of the nonparametric
estimator proposed by Lyndell-Bell (1971). The results are similar
to those results for the complete data case due to Beran (1977,
Ann. Statist.) and for the censored data case due to Wellner (1982,
Ann. Statist.). Both likelihood and functional (which was also used
in [14]) approaches are considered in the study. An attempt to
generalize the semiparametric theory to non-iid case was
considered for autoregressive models. Using the "projected score",
I
showed ([13]) that it is possible to construct adaptive estimators
of
the autoregressive coefficients. In a different direction, Hall and
I
([8]) developed bounds on the exponential rates of consistent
estimates in semiparametric models in the sense of large deviations.
The bounds can be treated as lower bounds for the asymptotic
effective standard deviation of such estimates. A directional
method was used in the study. The work is an extention of
Bahadur's results of parametric models. Semiparametric modeling
is very useful in statistics since many practical problems in
statistics are natually of semiparametric type. Most of the
statistical problems in semiparametric models are by now solved
and known. The book by Bickel, Klassen, Ritov and Wellner (1993)
includes general theory of semiparametric models and detailed
studies of many interesting models. The Fifth Lukacs Symposium
will be held in Browling Green, Ohio on March 24 and 25 of 1995.
The theme will be Statistical Inference in Semiparametric Models. I
was invited to give a talk in the Symposium.
4. Tests for Lack-of-Fit in Linear Models.
In linear models, we often want to verify the proposed model. When
replication is not available, the classical pure-error test no longer
works.
Such situations arise routinely under experimentalists who are simply
unaware of the need for replications. They may also arise due to
the very nature of an experiment where replications are impossible.
Many test procedures have been suggested under such
circumstances (Chow (1960), Shillington (1979), Utts (1982), and
Neill and Johnson (1985). The drawback is that the notion of
"cluster" or "near replicates" is sometimes highly subjective and,
more seriously, the exact procedures sacrifice certain information
or the approximate procedures do not leave any clue to the true p-
value and power of the tests. Ghosh and I ([19], [26]) propose a
simple class of exact tests which are based on subdivisions of the
observed points. We provide justifications for such a class from
different viewpoints. In particular, we (a) show that Utts' test and,
when replications are available, the classical test can be treated
as
special cases of our test, (b) demostrate by formal calculations that
our test can be more powerful than others under suitable choices of
subdivision, and (c) argue that our test should effectively dispense
with the need for the existing tests based on "near replicates".
Power comparisons and extensive numerical studies are also
included in our work. The paper is almost finished and is ready to
be submitted for publication. Some related generalizations will be
studied in near future.
5. Fixed-Width Confidence Interval for Bernoulli Probability.
Almost everyday we see statistical reports on newspapers,
magazines, journals and many books. Very often reports are
developed on the basis of statistical samplings. The yes-no type
questions are typically asked in a survey and inferences can then
be made from these Bernoulli variables. To reduce the cost of
sampling and to save time, we often want to conduct the
experiment in terms of sequential method. Ghosh and I ([20])
developed a sequential rule for fixed-width confidence intervals of
the Bernoulli probability. We are following the pioneer work by
Chow and Robins (1965, Ann. Math. Statist.) and especially making
uses of the special structures of binomial distribution in the
derivation of the confidence interval. Comparing to some known
methods, we found that the proposed fixed-width confidence
interval seems to have more reliable control on the coverage
probability and smaller expected sample size. Some interesting
links (in terms of first order approximations) to Chow and Robbin's
results are also developed. We intent (a) to investigate 2nd order
asymptotic properties of the proposed rule and (b) to use bootstrap
method to obtain useful approximations of the related distributions
and momemts. Part of the results was presented at the IMS
Sequential Workshop at UNC on June 18-19 of 1994. A draft of our
work is available.
6. Approximate Entropy and Applications.
For a while I was involved in analyzing medical data from hospitals
in
Lehigh Valley Area. One of the problems was to try to use the given
medial
data to detect the onset of an apprent life threatening event. The
analysis is to prevent sudden infant death syndrome. The data is
typically collected in terms of ECG, EEG and respiratory data.
Pincus and I ([2]) propose to use entropy as a chaos-related
patternness measure and to estimate it by a quantity called
approximate entropy. We use approximate entropy to quantify the
regularity (complexity) in data and to provide an information-
theoretic quantity for time series data. One of the difficult problems
is to obtain the the sampling distribution of the proposed
approximate entropy, especially in two-sample setup. Some results
were developed in the paper and we especially introdued the idea of
randomized approximate entropy in a spirit of sample reusing
method. The randomized approximate entropy derives an empirical
significant probability that two processes differ on the basis of one
data set from each process. As an application, it also provides a
test for the hypothesis that an underlying time-series is generated
by iid variables. It is always exciting to study real cases but it
is
also true that most of the real cases are not easy to handle. A
project at St. Lukes Hospital was proposed several months ago to
use the approximate entropy in a real case study and we hope to
see if it really works.
7. Bounds for Measures of Associations with Given Marginals &
Related Distributional Problems.
W.C. Shih of Merck Sharp Research Labs asked me a question if it is
possible
to have better bounds for the correlation coefficient when the marginal
distributions are specified. He raised the question because of the
needs in medical related statistics. Shih and I ([1], [27]) worked
out
the problem and provided a answer to the question. In [27], we
derived the lower and upper bounds using an argument on the
basis of Neyman-Pearson Lemma whcih is different from the
method used in Cambanis, Simon and Stout (1976, Zeit. Wahr.). In
fact, we came up with useful representations of the lower and upper
bounds in terms of quantile functions. Numerial examples are
included in [1]. During the Fall semester of 1993 I was visiting
Rutgers University, I had a great chance to discuss this problem
with Professor Kemperman. Kemperman gave me some useful
comments and I was able to redrive the bounds using completely
different argument. Hoeffding's formula for correlation and
Frechet's dirtributions were used in the new derivation and the
work is included in [25] and [21]. Part of it was presented at the
International Joint Statistical Conference in the Winter of 1993 in
Taiwan. In practice, the theoretical marginal distributions are
unknown, I proposed to use the empirical versions to estimate the
theoretical bounds. This empirical substitution leads to an
interesting link to the Hardy-Littlewood inequalities and
probabilistic version of Hardy-Littlewood inequalities. Related
distributional problems are treated in [22]. Paper [21] is pretty
much done and will be submitted for publication. Some interesting
distributional results are already obtained in a draft of [22]. Some
rather difficult and deep questions are involved in the distribuional
studies of the empirical lower and upper bounds.
8. Others. Other miscellaneous works I produced are permutation
tests for comparing two regressions ([6]), martingales induced by
Markov chains ([9]}, asymptotic efficient estimation of the interval
failure rates ([15]}. I also have some on going works in different
directions which I will not mention here.