granicus.if.org Git - liblinear/blob - python/README

   1 -------------------------------------
   2 --- Python interface of LIBLINEAR ---
   3 -------------------------------------
   4
   5 Table of Contents
   6 =================
   7
   8 - Introduction
   9 - Installation via PyPI
  10 - Installation via Sources
  11 - Quick Start
  12 - Quick Start with Scipy
  13 - Design Description
  14 - Data Structures
  15 - Utility Functions
  16 - Additional Information
  17
  18 Introduction
  19 ============
  20
  21 Python (http://www.python.org/) is a programming language suitable for rapid
  22 development. This tool provides a simple Python interface to LIBLINEAR, a library
  23 for support vector machines (http://www.csie.ntu.edu.tw/~cjlin/liblinear). The
  24 interface is very easy to use as the usage is the same as that of LIBLINEAR. The
  25 interface is developed with the built-in Python library "ctypes."
  26
  27 Installation via PyPI
  28 =====================
  29
  30 To install the interface from PyPI, execute the following command:
  31
  32 > pip install -U liblinear-official
  33
  34 Installation via Sources
  35 ========================
  36
  37 Alternatively, you may install the interface from sources by
  38 generating the LIBLINEAR shared library.
  39
  40 Depending on your use cases, you can choose between local-directory
  41 and system-wide installation.
  42
  43 - Local-directory installation:
  44
  45     On Unix systems, type
  46
  47     > make
  48
  49     This generates a .so file in the LIBLINEAR main directory and you
  50     can run the interface in the current python directory.
  51
  52     For Windows, the shared library liblinear.dll is ready in the
  53     directory `..\windows' and you can directly run the interface in
  54     the current python directory. You can copy liblinear.dll to the
  55     system directory (e.g., `C:\WINDOWS\system32\') to make it
  56     system-widely available. To regenerate liblinear.dll, please
  57     follow the instruction of building Windows binaries in LIBLINEAR
  58     README.
  59
  60 - System-wide installation:
  61
  62     Type
  63
  64     > pip install -e .
  65
  66     Please note that you must keep the sources after the installation.
  67
  68     For Windows, to run the above command, Microsoft Visual C++ and
  69     other tools are needed.
  70
  71     In addition, DON'T use the following FAILED commands
  72
  73     > python setup.py install (failed to run at the python directory)
  74     > pip install .
  75
  76 Quick Start
  77 ===========
  78
  79 "Quick Start with Scipy" is in the next section.
  80
  81 There are two levels of usage. The high-level one uses utility
  82 functions in liblinearutil.py and commonutil.py (shared with LIBSVM
  83 and imported by svmutil.py). The usage is the same as the LIBLINEAR
  84 MATLAB interface.
  85
  86 >>> from liblinear.liblinearutil import *
  87 # Read data in LIBSVM format
  88 >>> y, x = svm_read_problem('../heart_scale')
  89 >>> m = train(y[:200], x[:200], '-c 4')
  90 >>> p_label, p_acc, p_val = predict(y[200:], x[200:], m)
  91
  92 # Construct problem in python format
  93 # Dense data
  94 >>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
  95 # Sparse data
  96 >>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]
  97 >>> prob  = problem(y, x)
  98 >>> param = parameter('-s 0 -c 4 -B 1')
  99 >>> m = train(prob, param)
 100
 101 # Other utility functions
 102 >>> save_model('heart_scale.model', m)
 103 >>> m = load_model('heart_scale.model')
 104 >>> p_label, p_acc, p_val = predict(y, x, m, '-b 1')
 105 >>> ACC, MSE, SCC = evaluations(y, p_label)
 106
 107 # Getting online help
 108 >>> help(train)
 109
 110 The low-level use directly calls C interfaces imported by liblinear.py. Note that
 111 all arguments and return values are in ctypes format. You need to handle them
 112 carefully.
 113
 114 >>> from liblinear.liblinear import *
 115 >>> prob = problem([1,-1], [{1:1, 3:1}, {1:-1,3:-1}])
 116 >>> param = parameter('-c 4')
 117 >>> m = liblinear.train(prob, param) # m is a ctype pointer to a model
 118 # Convert a Python-format instance to feature_nodearray, a ctypes structure
 119 >>> x0, max_idx = gen_feature_nodearray({1:1, 3:1})
 120 >>> label = liblinear.predict(m, x0)
 121
 122 Quick Start with Scipy
 123 ======================
 124
 125 Make sure you have Scipy installed to proceed in this section.
 126 If numba (http://numba.pydata.org) is installed, some operations will be much faster.
 127
 128 There are two levels of usage. The high-level one uses utility functions
 129 in liblinearutil.py and the usage is the same as the LIBLINEAR MATLAB interface.
 130
 131 >>> import scipy
 132 >>> from liblinear.liblinearutil import *
 133 # Read data in LIBSVM format
 134 >>> y, x = svm_read_problem('../heart_scale', return_scipy = True) # y: ndarray, x: csr_matrix
 135 >>> m = train(y[:200], x[:200, :], '-c 4')
 136 >>> p_label, p_acc, p_val = predict(y[200:], x[200:, :], m)
 137
 138 # Construct problem in Scipy format
 139 # Dense data: numpy ndarray
 140 >>> y, x = scipy.asarray([1,-1]), scipy.asarray([[1,0,1], [-1,0,-1]])
 141 # Sparse data: scipy csr_matrix((data, (row_ind, col_ind))
 142 >>> y, x = scipy.asarray([1,-1]), scipy.sparse.csr_matrix(([1, 1, -1, -1], ([0, 0, 1, 1], [0, 2, 0, 2])))
 143 >>> prob  = problem(y, x)
 144 >>> param = parameter('-s 0 -c 4 -B 1')
 145 >>> m = train(prob, param)
 146
 147 # Apply data scaling in Scipy format
 148 >>> y, x = svm_read_problem('../heart_scale', return_scipy=True)
 149 >>> scale_param = csr_find_scale_param(x, lower=0)
 150 >>> scaled_x = csr_scale(x, scale_param)
 151
 152 # Other utility functions
 153 >>> save_model('heart_scale.model', m)
 154 >>> m = load_model('heart_scale.model')
 155 >>> p_label, p_acc, p_val = predict(y, x, m, '-b 1')
 156 >>> ACC, MSE, SCC = evaluations(y, p_label)
 157
 158 # Getting online help
 159 >>> help(train)
 160
 161 The low-level use directly calls C interfaces imported by liblinear.py. Note that
 162 all arguments and return values are in ctypes format. You need to handle them
 163 carefully.
 164
 165 >>> from liblinear.liblinear import *
 166 >>> prob = problem(scipy.asarray([1,-1]), scipy.sparse.csr_matrix(([1, 1, -1, -1], ([0, 0, 1, 1], [0, 2, 0, 2]))))
 167 >>> param = parameter('-c 4')
 168 >>> m = liblinear.train(prob, param) # m is a ctype pointer to a model
 169 # Convert a tuple of ndarray (index, data) to feature_nodearray, a ctypes structure
 170 # Note that index starts from 0, though the following example will be changed to 1:1, 3:1 internally
 171 >>> x0, max_idx = gen_feature_nodearray((scipy.asarray([0,2]), scipy.asarray([1,1])))
 172 >>> label = liblinear.predict(m, x0)
 173
 174 Design Description
 175 ==================
 176
 177 There are two files liblinear.py and liblinearutil.py, which respectively correspond to
 178 low-level and high-level use of the interface.
 179
 180 In liblinear.py, we adopt the Python built-in library "ctypes," so that
 181 Python can directly access C structures and interface functions defined
 182 in linear.h.
 183
 184 While advanced users can use structures/functions in liblinear.py, to
 185 avoid handling ctypes structures, in liblinearutil.py we provide some easy-to-use
 186 functions. The usage is similar to LIBLINEAR MATLAB interface.
 187
 188 Data Structures
 189 ===============
 190
 191 Three data structures derived from linear.h are node, problem, and
 192 parameter. They all contain fields with the same names in
 193 linear.h. Access these fields carefully because you directly use a C structure
 194 instead of a Python object. The following description introduces additional
 195 fields and methods.
 196
 197 Before using the data structures, execute the following command to load the
 198 LIBLINEAR shared library:
 199
 200     >>> from liblinear.liblinear import *
 201
 202 - class feature_node:
 203
 204     Construct a feature_node.
 205
 206     >>> node = feature_node(idx, val)
 207
 208     idx: an integer indicates the feature index.
 209
 210     val: a float indicates the feature value.
 211
 212     Show the index and the value of a node.
 213
 214     >>> print(node)
 215
 216 - Function: gen_feature_nodearray(xi [,feature_max=None])
 217
 218     Generate a feature vector from a Python list/tuple/dictionary, numpy ndarray or tuple of (index, data):
 219
 220     >>> xi_ctype, max_idx = gen_feature_nodearray({1:1, 3:1, 5:-2})
 221
 222     xi_ctype: the returned feature_nodearray (a ctypes structure)
 223
 224     max_idx: the maximal feature index of xi
 225
 226     feature_max: if feature_max is assigned, features with indices larger than
 227                  feature_max are removed.
 228
 229 - class problem:
 230
 231     Construct a problem instance
 232
 233     >>> prob = problem(y, x [,bias=-1])
 234
 235     y: a Python list/tuple/ndarray of l labels (type must be int/double).
 236
 237     x: 1. a list/tuple of l training instances. Feature vector of
 238           each training instance is a list/tuple or dictionary.
 239
 240        2. an l * n numpy ndarray or scipy spmatrix (n: number of features).
 241
 242     bias: if bias >= 0, instance x becomes [x; bias]; if < 0, no bias term
 243           added (default -1)
 244
 245     You can also modify the bias value by
 246
 247     >>> prob.set_bias(1)
 248
 249     Note that if your x contains sparse data (i.e., dictionary), the internal
 250     ctypes data format is still sparse.
 251
 252 - class parameter:
 253
 254     Construct a parameter instance
 255
 256     >>> param = parameter('training_options')
 257
 258     If 'training_options' is empty, LIBLINEAR default values are applied.
 259
 260     Set param to LIBLINEAR default values.
 261
 262     >>> param.set_to_default_values()
 263
 264     Parse a string of options.
 265
 266     >>> param.parse_options('training_options')
 267
 268     Show values of parameters.
 269
 270     >>> print(param)
 271
 272 - class model:
 273
 274     There are two ways to obtain an instance of model:
 275
 276     >>> model_ = train(y, x)
 277     >>> model_ = load_model('model_file_name')
 278
 279     Note that the returned structure of interface functions
 280     liblinear.train and liblinear.load_model is a ctypes pointer of
 281     model, which is different from the model object returned
 282     by train and load_model in liblinearutil.py. We provide a
 283     function toPyModel for the conversion:
 284
 285     >>> model_ptr = liblinear.train(prob, param)
 286     >>> model_ = toPyModel(model_ptr)
 287
 288     If you obtain a model in a way other than the above approaches,
 289     handle it carefully to avoid memory leak or segmentation fault.
 290
 291     Some interface functions to access LIBLINEAR models are wrapped as
 292     members of the class model:
 293
 294     >>> nr_feature =  model_.get_nr_feature()
 295     >>> nr_class = model_.get_nr_class()
 296     >>> class_labels = model_.get_labels()
 297     >>> is_prob_model = model_.is_probability_model()
 298     >>> is_regression_model = model_.is_regression_model()
 299
 300     The decision function is W*x + b, where
 301         W is an nr_class-by-nr_feature matrix, and
 302         b is a vector of size nr_class.
 303     To access W_kj (i.e., coefficient for the k-th class and the j-th feature)
 304     and b_k (i.e., bias for the k-th class), use the following functions.
 305
 306     >>> W_kj = model_.get_decfun_coef(feat_idx=j, label_idx=k)
 307     >>> b_k = model_.get_decfun_bias(label_idx=k)
 308
 309     We also provide a function to extract w_k (i.e., the k-th row of W) and
 310     b_k directly as follows.
 311
 312     >>> [w_k, b_k] = model_.get_decfun(label_idx=k)
 313
 314     Note that w_k is a Python list of length nr_feature, which means that
 315         w_k[0] = W_k1.
 316     For regression models, W is just a vector of length nr_feature. Either
 317     set label_idx=0 or omit the label_idx parameter to access the coefficients.
 318
 319     >>> W_j = model_.get_decfun_coef(feat_idx=j)
 320     >>> b = model_.get_decfun_bias()
 321     >>> [W, b] = model_.get_decfun()
 322
 323     For one-class SVM models, label_idx is ignored and b=-rho is
 324     returned from get_decfun(). That is, the decision function is
 325     w*x+b = w*x-rho.
 326
 327     >>> rho = model_.get_decfun_rho()
 328     >>> [W, b] = model_.get_decfun()
 329
 330     Note that in get_decfun_coef, get_decfun_bias, and get_decfun, feat_idx
 331     starts from 1, while label_idx starts from 0. If label_idx is not in the
 332     valid range (0 to nr_class-1), then a NaN will be returned; and if feat_idx
 333     is not in the valid range (1 to nr_feature), then a zero value will be
 334     returned. For regression models, label_idx is ignored.
 335
 336 Utility Functions
 337 =================
 338
 339 To use utility functions, type
 340
 341     >>> from liblinear.liblinearutil import *
 342
 343 The above command loads
 344     train()            : train a linear model
 345     predict()          : predict testing data
 346     svm_read_problem() : read the data from a LIBSVM-format file.
 347     load_model()       : load a LIBLINEAR model.
 348     save_model()       : save model to a file.
 349     evaluations()      : evaluate prediction results.
 350
 351 - Function: train
 352
 353     There are three ways to call train()
 354
 355     >>> model = train(y, x [, 'training_options'])
 356     >>> model = train(prob [, 'training_options'])
 357     >>> model = train(prob, param)
 358
 359     y: a list/tuple/ndarray of l training labels (type must be int/double).
 360
 361     x: 1. a list/tuple of l training instances. Feature vector of
 362           each training instance is a list/tuple or dictionary.
 363
 364        2. an l * n numpy ndarray or scipy spmatrix (n: number of features).
 365
 366     training_options: a string in the same form as that for LIBLINEAR command
 367                       mode.
 368
 369     prob: a problem instance generated by calling
 370           problem(y, x).
 371
 372     param: a parameter instance generated by calling
 373            parameter('training_options')
 374
 375     model: the returned model instance. See linear.h for details of this
 376            structure. If '-v' is specified, cross validation is
 377            conducted and the returned model is just a scalar: cross-validation
 378            accuracy for classification and mean-squared error for regression.
 379
 380            If the '-C' option is specified, best parameters are found
 381            by cross validation. The parameter selection utility is supported
 382            only by -s 0, -s 2 (for finding C) and -s 11 (for finding C, p).
 383            The returned structure is a triple with the best C, the best p,
 384            and the corresponding cross-validation accuracy or mean squared
 385            error. The returned best p for -s 0 and -s 2 is set to -1 because
 386            the p parameter is not used by classification models.
 387
 388
 389     To train the same data many times with different
 390     parameters, the second and the third ways should be faster..
 391
 392     Examples:
 393
 394     >>> y, x = svm_read_problem('../heart_scale')
 395     >>> prob = problem(y, x)
 396     >>> param = parameter('-s 3 -c 5 -q')
 397     >>> m = train(y, x, '-c 5')
 398     >>> m = train(prob, '-w1 5 -c 5')
 399     >>> m = train(prob, param)
 400     >>> CV_ACC = train(y, x, '-v 3')
 401     >>> best_C, best_p, best_rate = train(y, x, '-C -s 0') # best_p is only for -s 11
 402     >>> m = train(y, x, '-c {0} -s 0'.format(best_C)) # use the same solver: -s 0
 403
 404 - Function: predict
 405
 406     To predict testing data with a model, use
 407
 408     >>> p_labs, p_acc, p_vals = predict(y, x, model [,'predicting_options'])
 409
 410     y: a list/tuple/ndarray of l true labels (type must be int/double).
 411        It is used for calculating the accuracy. Use [] if true labels are
 412        unavailable.
 413
 414     x: 1. a list/tuple of l training instances. Feature vector of
 415           each training instance is a list/tuple or dictionary.
 416
 417        2. an l * n numpy ndarray or scipy spmatrix (n: number of features).
 418
 419     predicting_options: a string of predicting options in the same format as
 420                         that of LIBLINEAR.
 421
 422     model: a model instance.
 423
 424     p_labels: a list of predicted labels
 425
 426     p_acc: a tuple including accuracy (for classification), mean
 427            squared error, and squared correlation coefficient (for
 428            regression).
 429
 430     p_vals: a list of decision values or probability estimates (if '-b 1'
 431             is specified). If k is the number of classes, for decision values,
 432             each element includes results of predicting k binary-class
 433             SVMs. If k = 2 and solver is not MCSVM_CS, only one decision value
 434             is returned. For probabilities, each element contains k values
 435             indicating the probability that the testing instance is in each class.
 436             Note that the order of classes here is the same as 'model.label'
 437             field in the model structure.
 438
 439     Example:
 440
 441     >>> m = train(y, x, '-c 5')
 442     >>> p_labels, p_acc, p_vals = predict(y, x, m)
 443
 444 - Functions: svm_read_problem/load_model/save_model
 445
 446     See the usage by examples:
 447
 448     >>> y, x = svm_read_problem('data.txt')
 449     >>> m = load_model('model_file')
 450     >>> save_model('model_file', m)
 451
 452 - Function: evaluations
 453
 454     Calculate some evaluations using the true values (ty) and the predicted
 455     values (pv):
 456
 457     >>> (ACC, MSE, SCC) = evaluations(ty, pv, useScipy)
 458
 459     ty: a list/tuple/ndarray of true values.
 460
 461     pv: a list/tuple/ndarray of predicted values.
 462
 463     useScipy: convert ty, pv to ndarray, and use scipy functions to do the evaluation
 464
 465     ACC: accuracy.
 466
 467     MSE: mean squared error.
 468
 469     SCC: squared correlation coefficient.
 470
 471 - Function: csr_find_scale_parameter/csr_scale
 472
 473     Scale data in csr format.
 474
 475     >>> param = csr_find_scale_param(x [, lower=l, upper=u])
 476     >>> x = csr_scale(x, param)
 477
 478     x: a csr_matrix of data.
 479
 480     l: x scaling lower limit; default -1.
 481
 482     u: x scaling upper limit; default 1.
 483
 484     The scaling process is: x * diag(coef) + ones(l, 1) * offset'
 485
 486     param: a dictionary of scaling parameters, where param['coef'] = coef and param['offset'] = offset.
 487
 488     coef: a scipy array of scaling coefficients.
 489
 490     offset: a scipy array of scaling offsets.
 491
 492 Additional Information
 493 ======================
 494
 495 This interface was originally written by Hsiang-Fu Yu from Department of Computer
 496 Science, National Taiwan University. If you find this tool useful, please
 497 cite LIBLINEAR as follows
 498
 499 R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.
 500 LIBLINEAR: A Library for Large Linear Classification, Journal of
 501 Machine Learning Research 9(2008), 1871-1874. Software available at
 502 http://www.csie.ntu.edu.tw/~cjlin/liblinear
 503
 504 For any question, please contact Chih-Jen Lin <cjlin@csie.ntu.edu.tw>,
 505 or check the FAQ page:
 506
 507 http://www.csie.ntu.edu.tw/~cjlin/liblinear/faq.html