机器之心|一行代码不用写,就可以训练、测试、使用模型,这个项目帮你做到


机器之心报道
机器之心编辑部
igel 是 GitHub 上的一个热门工具 , 基于 scikit-learn 构建 , 支持 sklearn 的所有机器学习功能 , 如回归、分类和聚类 。 用户无需编写一行代码即可使用机器学习模型 , 只要有 yaml 或 json 文件 , 来描述你想做什么即可 。
机器之心|一行代码不用写,就可以训练、测试、使用模型,这个项目帮你做到
本文插图

一行代码不用写 , 就可以训练、测试和使用模型 , 还有这样的好事?
最近 , 软件工程师 Nidhal Baccouri 就在 GitHub 上开源了一个这样的机器学习工具——igel , 并登上了 GitHub 热榜 。 目前 , 该项目 star 量已有 1.5k 。
项目地址:https://github.com/nidhaloff/igel
该项目旨在为每一个人(包括技术和非技术人员)提供使用机器学习的便捷方式 。
项目作者这样描述创建 igel 的动机:「有时候我需要一个用来快速创建机器学习原型的工具 , 不管是进行概念验证还是创建快速 draft 模型 。 我发现自己经常为写样板代码或思考如何开始而犯愁 。 于是我决定创建 igel 。 」
【机器之心|一行代码不用写,就可以训练、测试、使用模型,这个项目帮你做到】igel 基于 scikit-learn 构建 , 支持 sklearn 的所有机器学习功能 , 如回归、分类和聚类 。 用户无需编写一行代码即可使用机器学习模型 , 只要有 yaml 或 json 文件 , 来描述你想做什么即可 。
其基本思路是在人类可读的 yaml 或 json 文件中将所有配置进行分组 , 包括模型定义、数据预处理方法等 , 然后让 igel 自动化执行一切操作 。 用户在 yaml 或 json 文件中描述自己的需求 , 之后 igel 使用用户的配置构建模型 , 进行训练 , 并给出结果和元数据 。
igel 目前支持的所有配置如下所示:
# dataset operationsdataset:type: csv# [str] -> type of your datasetread_data_options: # options you want to supply for reading your data (See the detailed overview about this in the next section)sep:# [str] -> Delimiter to use.delimiter:# [str] -> Alias for sep.header:# [int, list of int] -> Row number(s) to use as the column names, and the start of the data.names:# [list] -> List of column names to useindex_col: # [int, str, list of int, list of str, False] -> Column(s) to use as the row labels of the DataFrame,usecols:# [list, callable] -> Return a subset of the columnssqueeze:# [bool] -> If the parsed data only contains one column then return a Series.prefix:# [str] -> Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …mangle_dupe_cols:# [bool] -> Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.dtype:# [Type name, dict maping column name to type] -> Data type for data or columnsengine:# [str] -> Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.converters: # [dict] -> Dict of functions for converting values in certain columns. Keys can either be integers or column labels.true_values: # [list] -> Values to consider as True.false_values: # [list] -> Values to consider as False.skipinitialspace: # [bool] -> Skip spaces after delimiter.skiprows: # [list-like] -> Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.skipfooter: # [int] -> Number of lines at bottom of file to skipnrows: # [int] -> Number of rows of file to read. Useful for reading pieces of large files.na_values: # [scalar, str, list, dict] ->Additional strings to recognize as NA/NaN.keep_default_na: # [bool] ->Whether or not to include the default NaN values when parsing the data.na_filter: # [bool] -> Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.verbose: # [bool] -> Indicate number of NA values placed in non-numeric columns.skip_blank_lines: # [bool] -> If True, skip over blank lines rather than interpreting as NaN values.parse_dates: # [bool, list of int, list of str, list of lists, dict] ->try parsing the datesinfer_datetime_format: # [bool] -> If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them.keep_date_col: # [bool] -> If True and parse_dates specifies combining multiple columns then keep the original columns.dayfirst: # [bool] -> DD/MM format dates, international and European format.cache_dates: # [bool] -> If True, use a cache of unique, converted dates to apply the datetime conversion.thousands: # [str] -> the thousands operatordecimal: # [str] -> Character to recognize as decimal point (e.g. use ‘,’ for European data).lineterminator: # [str] -> Character to break file into lines.escapechar: # [str] ->One-character string used to escape other characters.comment: # [str] -> Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character.encoding: # [str] -> Encoding to use for UTF when reading/writing (ex. ‘utf-8’).dialect: # [str, csv.Dialect] -> If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quotingdelim_whitespace: # [bool] -> Specifies whether or not whitespace (e.g. ' ' or '') will be used as the seplow_memory: # [bool] -> Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference.memory_map: # [bool] -> If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.


推荐阅读