##根据数据集推测分布函数及其参数
给一个数据集,如何才能知道它是属于哪一个分布?之前“排队论”中讲过一个根据数字特征来判断的。除了方差,期望,还引进了一个概念叫变异系数(coefficient of variation)。定义的话就是标准差和期望的比值。一般不同的分布,变异系数不同且固定的,根据变异系数就能初步估计是什么分布了。
此外R语言中MASS包里有个fitdistr函数,这个函数的作用就是将手头的数据去拟合罗列出来的分布函数,然后给出拟合得最好时候的参数,我们可以根据返回值中的对数似然值的比较,来推测数据集究竟符合哪种分布。当然这个方法和变异系数一点关系都没有^_^
R语言源码
# Read in a table (in csv format) from standard input:
Table = data.matrix(read.csv( file = 'stdin.csv', header=TRUE ))
# All parameter values in this assignment will be integers !
Distribution = c( "normal", "t", "chi-squared", "lognormal", "exponential", "gamma", "logistic")
Distribution_can_have_negative_values = c( TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE )
# display a histogram for each column Dataset
for (j in 1:ncol(Table)) {
maximum_likelihood = -100000000
maximum_likelihood_dist_name = "none"
Dataset = Table[,j] # j-th dataset = the j-th column of the table
Dataset_is_nonnegative = !any( Dataset < 0 )
hist( Dataset, col="gray90", xlim=c(0,15), breaks=50, probability=TRUE )
for (i in 1:length(Distribution)) {
dist_name = Distribution[i]
if (Distribution_can_have_negative_values[i] || Dataset_is_nonnegative) {
# don't try to fit a nonnegative distribution to data that is negative
if (dist_name == "chi-squared") {
# fitdistr requires special handling of chi-squared
fit = suppressWarnings( fitdistr( Dataset, dist_name,
list(df=round(mean(Dataset))), method="BFGS" ) )
} else {
fit = suppressWarnings( fitdistr( Dataset, dist_name ) )
}
log_likelihood = fit$loglik
if(maximum_likelihood<log_likelihood){
maximum_likelihood=log_likelihood
maximum_likelihood_dist_name=dist_name
maximum_likelihood_fitted_parameters = fit$estimate
}
# The optimal distribution is the one with maximum-likelihood
# (and: maximum-likelihood == maximum-log-likelihood).
}
}
parameter_value_string = paste(round(maximum_likelihood_fitted_parameters), collapse="_")
# print integer parameters
cat(sprintf("%s_%s\n", maximum_likelihood_dist_name, parameter_value_string))
}
这里的话就是比较log-likelihood的值的大小,越大则拟合的越好(非绝对值越大)。