Java日期时间API系列39-----中文语句中的时间语义识别（time NLP 输入一句话，能识别出话里的时间）原理分析

　　NLP (Natural Language Processing) 是人工智能（AI）的一个子领域。自然语言是人类智慧的结晶，自然语言处理是人工智能中最为困难的问题之一（来自百度百科）。

其中中文更是不好处理。下面将分析中文语句中的时间的识别：time NLP 输入一句话，能识别出话里的时间。下面2种简单的实现方法。

1.单词的识别

这种比较简单，比如，今天，明天，下周，下月，明年，昨天，上周，上月，去年等。原理：匹配到明天就根据今天的时间天数加1。

/**
 * 常用时间枚举
 *
 * @author xkzhangsan
 */
public enum CommonTimeEnum {

    TODAY("today", "今天"),

    TOMORROW("tomorrow", "明天"),
    NEXTWEEK("nextWeek", "下周"),
    NEXTMONTH("nextMonth", "下月"),
    NEXTYEAR("nextYear", "明年"),

    YESTERDAY("yesterday", "昨天"),
    LASTWEEK("lastWeek", "上周"),
    LASTMONTH("lastMonth", "上月"),
    LASTYEAR("lastYear", "去年"),
    ;

    private String code;

    private String name;

    public String getCode() {
        return code;
    }

    public String getName() {
        return name;
    }

    CommonTimeEnum(String code, String name) {
        this.code = code;
        this.name = name;
    }

    public static Map<String, String> convertToMap(){
        Map<String, String> commonTimeMap = new HashMap<String, String>();
        for (CommonTimeEnum commonTimeEnum : CommonTimeEnum.values()) {
            commonTimeMap.put(commonTimeEnum.getCode(), commonTimeEnum.getCode());
            commonTimeMap.put(commonTimeEnum.getName(), commonTimeEnum.getCode());
        }
        return commonTimeMap;
    }

    public static CommonTimeEnum getCommonTimeEnumByCode(String code){
        for (CommonTimeEnum commonTimeEnum : CommonTimeEnum.values()) {
            if(commonTimeEnum.getCode().equals(code)){
                return commonTimeEnum;
            }
        }
        return null;
    }
}





    /**
     * 解析自然语言时间，今天，明天，下周，下月，明年，昨天，上周，上月，去年等。
     * @param text 自然语言时间，待解析字符串
     * @param  naturalLanguageMap 自定义自然语言时间map，其中key自定义，value需为 com.xkzhangsan.time.enums.CommonTimeEnum中的code；
     *                            可以为空，默认使用com.xkzhangsan.time.enums.CommonTimeEnum解析。
     * @return Date
     */
    public static Date parseNaturalLanguageToDate(String text, Map<String, String> naturalLanguageMap){
        if(StringUtil.isEmpty(text)){
            return null;
        }
        text = text.trim();

        boolean isCommonTimeMap = false;
        if(CollectionUtil.isEmpty(naturalLanguageMap)){
            naturalLanguageMap = CommonTimeEnum.convertToMap();
            isCommonTimeMap = true;
        }
        if(! naturalLanguageMap.containsKey(text) || StringUtil.isEmpty(naturalLanguageMap.get(text))){
            return null;
        }

        String targetMethod = null;
        if(isCommonTimeMap){
            targetMethod = naturalLanguageMap.get(text);
        }else{
            String code = naturalLanguageMap.get(text);
            Map<String, String> commonTimeMap = CommonTimeEnum.convertToMap();
            if(commonTimeMap.containsKey(code)){
                targetMethod = commonTimeMap.get(code);
            }
        }
        if(targetMethod == null){
            return null;
        }

        //执行结果
        CommonTimeEnum targetCommonTime = CommonTimeEnum.getCommonTimeEnumByCode(targetMethod);
        if(targetCommonTime == null){
            return null;
        }
        
        switch (targetCommonTime){
            case TODAY :
                return DateTimeCalculatorUtil.today();
            case TOMORROW:
                return DateTimeCalculatorUtil.tomorrow();
            case NEXTWEEK:
                return DateTimeCalculatorUtil.nextWeek();
            case NEXTMONTH:
                return DateTimeCalculatorUtil.nextMonth();
            case NEXTYEAR:
                return DateTimeCalculatorUtil.nextYear();
            case YESTERDAY:
                return DateTimeCalculatorUtil.yesterday();
            case LASTWEEK:
                return DateTimeCalculatorUtil.lastWeek();
            case LASTMONTH:
                return DateTimeCalculatorUtil.lastMonth();
            case LASTYEAR:
                return DateTimeCalculatorUtil.lastYear();
            default:
                return null;
        }
    }


//DateTimeCalculatorUtil



    /**
     * 明天
     * @return Date
     */
    public static Date tomorrow(){
        return plusDays(today(), 1);
    }

    /**
     * 今天
     * @return Date
     */
    public static Date today(){
        return new Date();
    }

2.中文语句中的时间的识别

这个是真实语境下的时间识别，比如 Hi，all.下周一下午三点开会，如果今天是2021-06-10 那么返回结果为：2021-06-14 15:00:00 。

2.1 原理和图解

原理和第一种类似，也是识别时间词语，根据基准时间推断结果，但更强大一些。

基本分为三步：

（1）加载正则文件

（2）解析中文语句中的所有时间词语

（3）根据基准时间，循环解析（2）中的时间词语

代码比较多这里就不贴了，详细可以看github上的项目代码，下面是简单流程图：

2.2 相关源码及说明

2.2.1 Time-NLP

github: https://github.com/shinyke/Time-NLP

author：shinyke

由复旦NLP中的时间分析功能修改而来，做了很多细节和功能的优化。

泛指时间的支持，如：早上、晚上、中午、傍晚等。
时间未来倾向。如：在周五输入“周一早上开会”，则识别到下周一早上的时间；在下午17点输入：“9点送牛奶给隔壁的汉子”则识别到第二天上午9点。
多个时间的识别，及多个时间之间上下文关系处理。如："下月1号下午3点至5点到图书馆还书"，识别到开始时间为下月1号下午三点。同时，结束时间也继承上文时间，识别到下月1号下午5点。
可自定义基准时间：指定基准时间为“2016-05-20-09-00-00-00”，则一切分析以此时间为基准。
修复了各种各样的BUG。

简而言之，这是一个输入一句话，能识别出话里的时间的工具。

2.2.2 xk-time TimeNLPUtil

https://github.com/xkzhangsan/xk-time TimeNLPUtil

在Time-NLP基础上做了很多优化：

（1）封装属性，重命名使符合驼峰命名标准。
（2）将加载正则资源文件改为单例加载。
（3）将类按照功能重新划分为单独的多个类。
（4）使用Java8日期API重写。
（5）增加注释说明，优化代码。
（6）修复原项目中的issue：标准时间yyyy-MM-dd、yyyy-MM-dd HH:mm:ss和yyyy-MM-dd HH:mm解析问题。
（7）修复原项目中的issue：1小时后，1个半小时后，1小时50分钟等解析问题；并且支持到秒，比如50秒后，10分钟30秒后等。
（8）修复原项目中的issue：修复当前时间是上午10点，那么下午三点会识别为明天下午三点问题。
（9）修复原项目中的issue：修复小数解析异常问题。
（10）性能优化，将使用到的正则预编译后放到缓存中，下次直接使用，提高性能。

3 实现方法的局限性

第一种只能识别单词；

第二种也只能识别正则文件中的词语，比第一种识别能力更强，但如果有新的或不常用的时间词语无法处理，比如星期一的同义词礼拜一等，如果要不断支持新的词语，需要不断的修改，不如机器学习好；

对于常用的时间词语识别，第二种已经达到很高的识别率。

4.开发这个功能的原因

第一种实现，因为有网友需要识别中文时间词语，我写了第一种的实现；

第二种实现，另一个网友有需要识别语句中的中文时间词语，他向往推荐了Time-NLP这个项目，说这个项目很好，不维护了，有一些小问题，希望我能参考实现，我研究了原项目代码，在我的项目中重写，优化，并修复了一些问题。

感谢shinyke，这个项目很好，学习到很多正则解析的知识。

源码地址： https://github.com/xkzhangsan/xk-time

寻找撬动地球的支点（解决问题的方案），杠杆（Java等编程语言）已经有了。xkzhangsan

相关阅读:
学习linux之用户-文件-权限操作
 Hadoop--Hadoop的机架感知
 redhat 6.3 64位安装中文输入法全过程记录
 hdu 4619 Warm up 2（并查集）
openGL 初试绘制三角形和添加鼠标键盘事件
 MySQL 启动服务报错解决方案
 20亿与20亿表关联优化方法(超级大表与超级大表join优化方法)
50行python代码实现个代理server（你懂的）
nginx+tomcat反复请求
 慢慢过渡到个人博客
原文地址：https://www.cnblogs.com/xkzhangsanx/p/14873321.html