• 重构一个运行超过10年的老项目


      去年下半年我接手了一个外包的项目维护任务,这个项目大约开始于2005年,项目用的是传统的三层架构,主要功能就是一个网络爬虫,爬取国外各种电商的商品数据,存入客户数据库。最近我对项目的重构已经通过验收,我想说说我的重构思路。

    阶段一 熟悉了项目框架,了解项目的运行和维护。

    使用工具: Microsoft Visual Studio2005 , SQL SERVER2005, axosoft ontime scrum,SVN

    开发流程:客户提供需求文档,编码,单元测试,UAT部署,UAT测试,客户部署,QA测试

    项目分层:

    在这个阶段,我发现了几个问题

    1. 很多需求文档已经丢失
    2. 代码逻辑与需求文档不匹配
    3. 大量重复代码
    4. 用于匹配数据的正则表达式全部存储于数据库,难以阅读,不方便修改
    5. 很多正则变大时过于复杂
    <lis*[^>]+list-view-*(?<css>[^"]*)"[^>]*>s*  <h[^>]*>s*<a[sS]*?href=[dD]{1,100}?(?<=MLA-)(?<id>d+)[^<]*>s* (?<name>[dD]{0,500}?)</a>s*  (?:<a[^>]*>)?s*<is*class="ch-icon-heart"></i>s*</a>s*</hd+>s*  (?:<ps*[^>]+list-view-item-subtitle">(?<subTitle>[dD]{0,5000}?)</p>)?s*  (?:<ul[^>]*>(?<subTitle2>[dD]{0,5000}?)</ul>)?s*   (?:<as*href=[^>]+>)?s*(?:<im[dD]{1,200}?(?:-|_)MLA(?<photo>[^.]+)[^>]+>)?s*(?:</a>)?s*(?:<img[dD]{1,200}?images/(?<photo2>[^.]+)[^>]+>)?s*(?:</a>)?s*  [dD]*?  <s*[^>]+price-info">s*  (?:<[^>]+price-info-cost">(?:[dD]*?)<strongs*[^>]+price">s*(?<currency>[^d&]*)(?:&nbsp;)?(?<price>d+(?:.d{3})* (?: .d+)? ) s*(?:<sup>(?<priceDecimales>d*)</sup>s*)?  (?: s*<span[^>]*>[^<]*</span>)?  s*</strong>s*(?:</div>s*)?  (?:<strongs*[^>]+price-info-auction">(?<type>[^<]*)</strong>)?s*  (?:<spans*[^>]+price-info-auction-endTime">[^<d]*?(?:(?<day>d+)d)?s*(?:(?<hour>d+)h)? s*(?:(?<minute>d+)m)? s* (?:(?<second>d+)s)?s*</span>s*)?(?:</span>)?s*  (?:<spans*[^>]+price-info-installments"><spans*class=installmentsQuantity>(?<numberOfPayment>d+)</span>[^<]+  <spans*[^>]+price">s*[^<]*?(?<pricePayment>d+(?:.d{3})* (?: .d+)? )s*<sup>(?<pricePaymentDecimales>[dD]{0,10}?)</sup>s*</span>s*  </span>s*)?|<[^>]*[^>]+price-info-cost-agreed">[^>]*</[^>]*>s*)(?:</p>)?s*  [dD]*?  (?:<uls*class="medal-list">s*<lis*[^>]+mercadolider[^>]*>(?<sellerBagde>[dD]{0,500}?)</li>s*</ul>s*)?  <uls*[^>]+extra-info">s*(?:<lis*class="ch-icos*search[^>]+">[^<]*</li>s*)?  (?:<lis*[^>]+mercadolider[^>]*>(?<sellerBagde>[dD]{0,500}?)</li>)?s*(?:<!--s*-->)?s*  (?:<lis*[^>]+[^>]*(?:condition|inmobiliaria|concesionaria)">s*(?:<strong>)?(?<condition>[^d<]*?)(?:</strong>)?s*</li>s*)?s*  (?:<lis*[^>]+"extra-info-sold">(?<bids>d+)*[^<]*</li>s*)?  (?:  <lis*[^>]+[^>]*location">(?<location>[^<]*?)s*</li>s*(?:<lis*class="free-shipping">[^<]*</li>s*)?  |<li>(?<location>[^<]*?)s*</li>s*)?(?:<lis*class="free-shipping">[^<]*</li>s*)?  (?:</ul> |<li[^>]*>s*Tel.?:s*(?<phone>[^<]+)</li>)  |  <divs*[^>]+item-[^>]*>s*<h[^>]*>s*<as*href=[dD]{1,100}?(?<=MLA-)(?<id>d+)[^<]*>s* (?<name>[dD]{0,500}?)</a>s*</h3>s*  (?:[dD]*?)<lis*[^>]+costs"><spans*[^>]+price">s*(?<currency>[^d&]*)(?:&nbsp;)?(?<price>d+(?:.d{3})* (?: .d+)? ) s*</span></li>  (?:[dD]*?)(?:</ul> |<li[^>]*>s*Tel.?:s*(?:&nbsp;)*(?<phone>[^<]+)</li>)
    一个复杂的正则表达式

    阶段二 完善全部需求文档,将所有正则提取成文件

    开发流程增加最后一环,更新文档。当测试或维护完成后,必须修改需求文档,将所有正则提取成文件,减少维护SQL的工作量,减少新人维护sql出错的可能性。在我和QA的努力下,200多份需求文档被重新整理完毕,为维护项目提供思路。

    阶段三 修改数据访问层

    去除传统的数据访问层代码,于是准备上Entity Framework,和客户沟通后,客户更熟悉Nhibernate,于是封装Repository,这个仓储层封装和领域驱动没有多大关系,只是一个大号的DbHelper而已.

            public void SaveInfoByCity(InfoByCity line, string config)
            {
                SQLQuery query = new SQLQuery();
    
                query.CommandType = CommandType.StoredProcedure;
                query.CommandText = "HangZhou_InsertInfoByCity";
    
                SqlParameter[] parameters = new SqlParameter[7];
                parameters[0] = new SqlParameter("@City", line.City);
                parameters[1] = new SqlParameter("@AvailableUnits", line.AvailableUnits);
                parameters[2] = new SqlParameter("@AvailableSqm", line.AvailableSqm);
                parameters[3] = new SqlParameter("@ResAvailUnits", line.ResAvailUnits);
                parameters[4] = new SqlParameter("@ResAvailSqm", line.ResAvailSqm);
                parameters[5] = new SqlParameter("@ReservedUnits", line.ReservedUnits);
                parameters[6] = new SqlParameter("@ReservedSqm", line.ReservedSqm);
    
                SqlHelper.ExecuteNonQuery(ConnectionStringManager.GetConnectionString(CALLER_ASSEMBLY_NAME, config),
                    query.CommandType, query.CommandText, parameters);
    
            }
    去除Dao代码

            /// <summary>
            /// SaveRestaurant
            /// </summary>
            /// <param name="restaurant"></param>
            public void SaveRestaurant(Restaurant restaurant)
            {
                restaurant.RunId = RunId;
                restaurant.RunDate = RunDate;
                restaurant.InsertUpdateDate = DateTime.Now;
                RepositoryHelper.CreateEntity(restaurant);
            }
    保存数据

    阶段四 去除大量重复代码

    所有的业务抽象出来就三个部分 下载 匹配 保存,因此封装了大量的公共方法,是每个任务编码更加简单,易于维护。

    阶段五 修改匹配方式

    项目原来就是利用正则匹配数据,有些网站数据比较复杂,导致正则过大。而且往往网站稍微改变一点点,整个正则就匹配不到任何数据,正则维护难度也比较大。

    首选我想到的就是封装树状结构,将正则分化治之。试运行一段时间后发现,维护调试树状结构的正则表达式简直要命,于是放弃。但是我觉的将页面无限分割,再进行匹配的思路应该是正确的,因为这样更加容易维护。在思考和搜索中,我发现了HtmlSelector。用HtmlSelector做DOM选择,然后用正则匹配细节。逐渐封装成现在的样子,下面提供一个案例。

    using System;
    using System.Collections.Generic;
    using System.Text;
    using Majestic.Bot.Core;
    using System.Diagnostics;
    using Majestic.Util;
    using Majestic.Entity.Shared;
    using Majestic.Entity.ECommerce.Hungryhouse;
    using Majestic.Dal.ECommerce;
    
    namespace Majestic.Bot.Job.ECommerce
    {
        public class Hungryhouse : JobRequestBase
        {
            private static string proxy;
            private static string userAgent;
            private static string domainUrl = "https://hungryhouse.co.uk/";
            private static string locationUrl = "https://hungryhouse.co.uk/takeaway";
            private int maxRetries;
            private int maxHourlyPageView;
            private HttpManager httpManager = null;
            private int pageCrawled = 0;
    
            /// <summary>
            /// This method needs to be defined here is primarily because we want to use the top level
            /// class name as the logger name. So even if the base class can log using the logger defined by
            /// the derived class, not by the base class itself
            /// </summary>
            /// <param name="row"></param>
            public override void Init(Majestic.Dal.Shared.DSMaj.Maj_vwJobDetailsRow row)
            {
                StackFrame frame = new StackFrame(0, false);
    
                base.Init(frame.GetMethod().DeclaringType.FullName, row);
            }
    
            /// <summary>
            /// Initializes the fields
            /// </summary>
            private void Initialize()
            {
                try
                {
                    JobSettingCollection jobSettingCollection = base.GetJobSettings(JobId);
                    proxy = jobSettingCollection.GetValue("proxy");
                    userAgent = jobSettingCollection.GetValue("userAgent");
                    maxRetries = jobSettingCollection.GetValue<int>("maxRetryTime", 3);
                    maxHourlyPageView = jobSettingCollection.GetValue<int>("maxHourlyPageView", 4500);
                    InithttpManager();
                    InitPattern();
                }
                catch (Exception ex)
                {
                    throw new MajException("Error intializing job " + m_sConfig, ex);
                }
            }
    
            /// <summary>
            /// Initialize the httpManager instance 
            /// </summary>
            private void InithttpManager()
            {
                if (String.IsNullOrEmpty(proxy) || proxy.Equals("none"))
                {
                    throw new Exception("proxy was not set! job ended!");
                }
    
                httpManager = new HttpManager(proxy, this.maxHourlyPageView,
                    delegate(string page)
                    {
                        if (page.Contains("macys.com is temporarily closed for scheduled site improvements"))
                        {
                            return false;
                        }
                        else
                        {
                            return ComUtil.CommonValidateFun(page);
                        }
                    },
                    this.maxRetries);
    
                httpManager.SetHeader("Upgrade-Insecure-Requests", "1");
                httpManager.AcceptEncoding = "gzip, deflate, sdch";
                httpManager.AcceptLanguage = "en-US,en;q=0.8";
                httpManager.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
    
                if (!string.IsNullOrEmpty(userAgent))
                {
                    httpManager.UserAgent = userAgent;
                }
            }
    
            /// <summary>
            /// InitPattern
            /// </summary>
            private void InitPattern()
            {
                PatternContainerHelper.Load("Hungryhouse.pattern.xml");
            }
    
            /// <summary>
            /// The assembly entry point that controls the internal program flow. 
            /// It is called by the Run() function in the base class 
            /// <see cref="MajesticReader.Lib.JobBase"/>
            /// The program flow:
            /// 1. Get the job requests <see cref="MajesticReader.Lib.HitBoxJobRequest /> based on JobId
            /// 2. For each request, get the input parameters
            /// 3. Retrieve the Html content
            /// 4. identify and collect data based on the configration settings for the request
            /// 5. Save collected data
            /// </summary>
            protected override void OnRun()
            {
                try
                {
                    Initialize();
    
                    int jobId = base.JobId;
                    Log.RunId = base.RunId;
    
                    HungryhouseDao.RunId = RunId;
                    HungryhouseDao.RunDate = DateTime.Now;
    
                    //get current job name
                    string jobName = base.GetJobName();
    
                    //Log start time
                    Log.Info("Hungryhouse Started", string.Format(
                        "Job {0} - {1} Started at {2}", jobId, jobName, DateTime.Now));
    
                    CollectLocation();
    
                    //Log end time
                    Log.Info("Hungryhouse Finished", string.Format(
                        "Job {0} - {1} Finished at {2}. {3} pages were crawled",
                        jobId, jobName, DateTime.Now, pageCrawled));
    
                }
                catch (Exception ex)
                {
                    // This should have never happened. So it is "Unexpeced"
                    Log.Error("Unexpected/Unhandled Error", ex);
                    throw new Exception("Unexpected/Unhandled Error", ex);
                }
            }
    
            /// <summary>
            /// CollectLocation
            /// </summary>
            private void CollectLocation()
            {
                Log.Info("Started Getting Locations", "Started Getting Locations");
                string page = DownloadPage(locationUrl);
                JobData locationData = ExtractData(page, "LocationArea", PatternContainerHelper.ToJobPatternCollection());
                JobDataCollection locationList = locationData.GetList();
                if (locationList.Count == 0)
                {
                    Log.Warn("can not find locations", "can not find locations");
                    return;
                }
                Log.Info("Locations", locationList.Count.ToString());
                foreach (JobData location in locationList)
                {
                    string url = location.GetGroupData("Url").Value;
                    string name = location.GetGroupData("Name").Value;
                    if (string.IsNullOrEmpty(url) || string.IsNullOrEmpty(name))
                    {
                        continue;
                    }
                   url = ComUtil.GetFullUrl(url, domainUrl);
                   CollectRestaurant(name, url);
                }
                Log.Info("Finished Getting Locations", "Finished Getting Locations");
            }
    
            /// <summary>
            /// CollectRestaurant
            /// </summary>
            /// <param name="name"></param>
            /// <param name="url"></param>
            private void CollectRestaurant(string name, string url)
            {
                Log.Info("Started Getting Restaurant", string.Format("Location:{0},Url:{1}",name,url));
                string page = DownloadPage(url);
                JobData restaurantData = ExtractData(page, "RestaurantArea", PatternContainerHelper.ToJobPatternCollection());
                JobDataCollection restaurantList = restaurantData.GetList();
                if (restaurantList.Count == 0)
                {
                    Log.Warn("can not find restaurant", string.Format("Location:{0},Url:{1}", name, url));
                    return;
                }
    
                Log.Info("Restaurants", string.Format("Location:{0},Url:{1}:{2}", name, url, restaurantList.Count));
                foreach (JobData restaurant in restaurantList)
                {
                    string tempUrl = restaurant.GetGroupData("Url").Value;
                    string tempName = restaurant.GetGroupData("Name").Value;
                    if (string.IsNullOrEmpty(tempUrl) || string.IsNullOrEmpty(tempName))
                    {
                        continue;
                    }
                    tempUrl = ComUtil.GetFullUrl(tempUrl, domainUrl);
                    CollectDetail(tempUrl, tempName);
                }
                Log.Info("Finished Getting Restaurant", string.Format("Location:{0},Url:{1}", name, url));
            }
    
            /// <summary>
            /// Collect detail
            /// </summary>
            /// <param name="url"></param>
            /// <param name="name"></param>
            private void CollectDetail(string url,string name)
            {
                string page = DownloadPage(url);
                Restaurant restaurant = new Restaurant();
                restaurant.Name = name;
                restaurant.Url = url;
    
                JobData restaurantDetailData = ExtractData(page, "RestaurantDetailArea", PatternContainerHelper.ToJobPatternCollection());
                restaurant.Address = restaurantDetailData.GetGroupData("Address").Value;
                restaurant.Postcode = restaurantDetailData.GetGroupData("Postcode").Value;
                string minimum = restaurantDetailData.GetGroupData("Minimum").Value;
                if (!string.IsNullOrEmpty(minimum) && minimum.ToLower().Contains("minimum"))
                {
                    restaurant.Minimum = minimum;
                }
    
                try
                {
                    HungryhouseDao.Instance.SaveRestaurant(restaurant);
                }
                catch (Exception ex)
                {
                    Log.Error("Failed to save restaurant",url,ex);
                }
            }
    
            /// <summary>
            /// Downloads pages by taking sleeping time into consideration
            /// </summary>
            /// <param name="url">The url that the page is going to be downloaded from</param>
            /// <returns>The downloaded page from the specified url</returns>
            private string DownloadPage(string url)
            {
                string result = string.Empty;
                result = httpManager.DownloadPage(url);
                pageCrawled++;
                return result;
            }
    
        }
    }
    Demo
    <?xml version="1.0"?>
    <PatternContainer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <Patterns>
        
        <!-- LocationArea -->
        <Pattern Name="LocationArea" Description="LocationArea" HtmlSelectorExpression=".CmsRestcatCityLandingLocations">
          <SubPatterns>
            <Pattern Name="Location" Description="Location" IsList="true" Field="Name,Url">
              <Expression>
                <![CDATA[
              <li[^>]*>s*<a[^>]*href[^"]*"(?<Url>[^"]*)"[^>]*>s*(?<Name>[^<]*)</a>
              ]]>
              </Expression>
            </Pattern>
          </SubPatterns>
        </Pattern>
    
        <!-- LocationArea -->
        <Pattern Name="RestaurantArea" Description="RestaurantArea" HtmlSelectorExpression=".CmsRestcatLanding.CmsRestcatLandingRestaurants.panel.mainRestaurantsList">
          <SubPatterns>
            <Pattern Name="Restaurant" Description="Restaurant" IsList="true" Field="Name,Url">
              <Expression>
                <![CDATA[
              <li[^>]*restaurantItemInfoName[^>]*>s*<a[^>]*href[^"]*"(?<Url>[^"]*)"[^>]*>s*<span>s*(?<Name>[^<]*)</span>
              ]]>
              </Expression>
            </Pattern>
          </SubPatterns>
        </Pattern>
    
        <!-- RestaurantArea -->
        <Pattern Name="RestaurantDetailArea" Description="Restaurant Detail Area">
          <SubPatterns>
            <Pattern Name="Address" Description="Address" Field="Address" HtmlSelectorExpression="span[itemprop=streetAddress]" />
            <Pattern Name="Postcode" Description="Postcode" Field="Postcode" HtmlSelectorExpression="span[itemprop=postalCode]" />
            <Pattern Name="Minimum" Description="Minimum" Field="Minimum">
              <Expression>
                <![CDATA[
                  <div[^>]*orderTypeCond[^>]*>s*<p>[sS]*?<span[^>]*>s*(?<Minimum>[^<]*)</span>
                ]]>
              </Expression>
            </Pattern>
          </SubPatterns>
        </Pattern>    
    
      </Patterns>
    </PatternContainer>
    Demo正则

     作者:Dynamic-xia

     博客地址:http://www.cnblogs.com/dynamic-xia                       

     声明:本博客以学习、研究和分享为主,欢迎转载,但必须在文章页面明显位置给出原文连接。

  • 相关阅读:
    数据库三范式(转)
    Tyrion中文文档(含示例源码)
    mongodb数据库导入导出恢复
    HTTP协议:Content-Type
    requests爬虫组件
    JS 数组对象
    JS 函数
    javascript window对象属性和方法
    js String对象
    Math对象-JavaScript
  • 原文地址:https://www.cnblogs.com/dynamic-xia/p/5382788.html
Copyright © 2020-2023  润新知